Skip to main content
Advertisement
  • Loading metrics

Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics

  • Lisa M. Breckels,

    Affiliations Computational Proteomics Unit, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom, Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

  • Sean B. Holden,

    Affiliation Computer Laboratory, University of Cambridge, Cambridge, United Kingdom

  • David Wojnar,

    Affiliation Quantitative Biology Center, Universität Tübingen, Tübingen, Germany

  • Claire M. Mulvey,

    Affiliation Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

  • Andy Christoforou,

    Affiliation Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

  • Arnoud Groen,

    Affiliation Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

  • Matthew W. B. Trotter,

    Affiliation Celgene Institute for Translational Research Europe (CITRE), Sevilla, Spain

  • Oliver Kohlbacher,

    Affiliations Quantitative Biology Center, Universität Tübingen, Tübingen, Germany, Center for Bioinformatics, Universität Tübingen, Tübingen, Germany, Biomolecular Interactions, Max Planck Institute for Developmental Biology, Tübingen, Germany

  • Kathryn S. Lilley,

    Affiliation Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

  • Laurent Gatto

    lg390@cam.ac.uk

    Affiliations Computational Proteomics Unit, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom, Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

Abstract

Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis.

Author Summary

Sub-cellular localisation of proteins is critical to their function in all cellular processes; proteins localising to their intended micro-environment, e.g organelles, vesicles or macro-molecular complexes, will meet the interaction partners and biochemical conditions suitable to pursue their molecular function. Therefore, sound data and methods to reliably and systematically study protein localisation, and hence their mis-localisation and the disruption of protein trafficking, that are relied upon by the cell biology community, are essential. Here we present a method to infer protein localisation relying on the optimal integration of experimental mass spectrometry-based data and auxiliary sources, such as GO annotation, outputs from third-party software, protein-protein interactions or immunocytochemistry data. We found that the application of transfer learning algorithms across these diverse data sources considerably improves on the quantity and reliability of sub-cellular protein assignment, compared to single data classifiers previously applied to infer sub-cellular localisation using experimental data only. We show how our method does not compromise biologically relevant experimental-specific signal after integration with heterogeneous freely available third-party resources. The integration of different data sources is an important challenge in the data intensive world of biology and we anticipate the transfer learning methods presented here will prove useful to many areas of biology, to unify data obtained from different but complimentary sources.

This is a PLoS Computational Biology methods paper.

Introduction

Cell biology is currently undergoing a data-driven paradigm shift [1]. Molecular biology tools, imaging, biochemical analyses and omics technologies, enable cell biologists to track the complexity of many fundamental processes such as signal transduction, gene regulation, protein interactions and sub-cellular localisation [2]. This remarkable success, has resulted in dramatic growth in data over the last decade, both in terms of size and heterogeneity. Coupled with this influx of experimental data, databases such as UniProt [3] and the Gene Ontology [4] have become more information rich, providing valuable resources for the community. The time is ripe to take advantage of complementary data sources in a systematic way to support hypothesis- and data-driven research. However, one of the biggest challenges in computational biology is how to meaningfully integrate heterogeneous data; transfer learning, a paradigm in machine learning, is ideally suited to this task.

Transfer learning has yet to be fully exploited in computational biology. To date, various data mining and machine learning (ML) tools, in particular classification algorithms have been widely applied in many areas of biology [5]. A classifier is trained to learn a mapping between a set of observed instances and associated external attributes (class labels) which is subsequently used to predict the attributes on data with unknown class labels (unlabelled data). In transfer learning, there is a primary task to solve, and associated primary data which is typically expensive, of high quality and targeted to address a specific question about a specific biological system/condition of interest. While standard supervised learning algorithms seek to learn a classifier on this data alone, the general idea in transfer learning is to complement the primary data by drawing upon an auxiliary data source, from which one can extract complementary information to help solve the primary task. The secondary data typically contains information that is related to the primary learning objective, but was not primarily collected to tackle the specific primary research question at hand. These data can be heterogeneous to the primary data and are often, but not necessarily, cheaper to obtain and more plentiful but with lower signal-to-noise ratio.

There are several challenges associated with the integration of information from auxiliary sources. Firstly, if the primary and auxiliary sources are combined via straightforward concatenation the signal in the primary can be lost through dilution with the auxiliary due to the latter being more plentiful and often having lower signal-to-noise ratio (see Fig H in S5 File for an illustration). Feature selection can be used to extract the attributes with the most distinct signals, however the challenge still remains in how to combine this data in a meaningful way. Secondly, combining data that exist in different data spaces is often not straightforward and different data types can be sensitive to the classifier employed, in terms of classifier accuracy.

In one of the first applications of transfer learning Wu and Dietterich [6] used a k-nearest neighbours (k-NN) and support vector machine (SVM) framework for plant image classification. Their primary data consisted of high-resolution images of isolated plant leaves and the primary task was to determine the tree species given an isolated leaf. An auxiliary data source was available in the form of dried leaf samples from a Herbarium. Using a kernel derived from the shapes of the leaves and applying the transfer learning (TL) framework [6], they showed that when primary data is small, training with auxiliary data improves classification accuracy considerably. There were several limitations in their methods: firstly, the data in the k-NN TL classifier were only weighted by data source and not on a class-by-class basis, and, secondly in the SVM framework both data sources were expected to have the same dimensions and lie in the same space. We present an adaption and significant improvement of this framework and extend the usability of the method by (i) incorporating a multi-class weighting schema in the k-NN TL classifier, and (ii) by allowing the integration of primary and auxiliary data with different dimensions in the SVM schema to allow the integration of heterogeneous data types. We apply this framework to the task of protein sub-cellular localisation prediction from high resolution mass spectrometry (MS)-based data.

Spatial proteomics, the systematic large-scale analysis of a cell’s proteins and their assignment to distinct sub-cellular compartments, is vital for deciphering a protein’s function(s) and possible interaction partners. Knowledge of where a protein spatially resides within the cell is important, as it not only provides the physiological context for their function but also plays an important role in furthering our understanding of a protein’s complex molecular interactions e.g. signalling and transport mechanisms, by matching certain molecular functions to specific organelles.

There are a number of sources of information which can be utilised to assign a protein to a sub-cellular niche. These range from high quality data produced from experimental high-throughput quantitative MS-based methods (e.g. LOPIT [7] and PCP [8]) and imaging data (e.g. [9]), to freely available data from repositories and amino acid sequences. The former, in a nutshell, involves cell lysis followed by separation and fractionation of the subcellular structures as a function of their density, and then selecting a set of distinct fractions to quantify by mass-spectrometry. These quantitative protein profiles are representative of organelle distribution and hence are indicative of their subcellular localisation [10]. Based on the distribution of a set of known genuine organelle marker proteins, pattern recognition and ML methods can be used to match and associate the distributions of unknown residents to that of one of the markers. There is thus a reliance on reliable organelle markers and statistical learning methods for robust proteome-wide localisation prediction [11]. These approaches have been utilised to gain information about the sub-cellular location of proteins in several biological examples, such as Arabidopsis [7, 1216], Drosophila [17], yeast [18], human cell lines [19, 20], mouse [8, 21] and chicken [22], using a number of algorithms, such as, SVMs [23], k-NN [15], random forest [24], naive Bayes [14], neural networks [25], and partial-least squares discriminant analysis [7, 17, 22].

Although the application of computational tools to spatial proteomics is a recent development, the determination of protein localisation using in silico data such as amino acid sequence features (e.g. [2640]), functional domains (e.g. [41, 42]), protein-protein interactions (e.g. [4345]) and the Gene Ontology (GO) [4] (e.g. [4649]) is well-established (reviewed in [5052]). One may question the biological relevance and ultimate utility to cell biology of such predictors as protein sequences and their annotation do not change according to cellular condition or cell type, whereas protein localisation can change in response to cellular perturbation. Notwithstanding the inherent limitations of using in silico data to predict dynamic cell- and condition-specific protein location, transfer learning [6, 4749, 53] may allow the transfer of complementary information available from these data to classify proteins in experimental proteomics datasets.

Here, we present a new transfer learning framework for the integration of heterogeneous data sources, and apply it to the task of sub-cellular localisation prediction from experimental and condition-specific MS-based quantitative proteomics data. Using the k-NN and SVM algorithms in a transfer learning framework we find that when given data from a high quality MS experiment, integrating data from a second less information rich but more plentiful auxiliary data source directly in to classifier training and classifier creation results in the assignment of proteins to organelles with high generalisation accuracy. Five experimental MS LOPIT datasets, from four different species, were employed in testing the classifiers. We show the flexibility of the pipeline through testing four auxiliary data sources; (1) Gene Ontology terms, (2) immunocytochemistry data [9], (3) sequence and annotation features, and (4) protein-protein interaction data [54]. The results obtained demonstrate that this transfer learning method outperforms a single classifier trained on each single data source alone and on a class-by-class basis, highlighting that the primary data is not diluted by the auxiliary data. This methodology forms part of the open-source open-development Bioconductor [55] pRoloc [56] suite of computational methods available for organelle proteomics data analysis.

Results

Here, we have adapted a classic application of inductive transfer learning (TL) [6] using experimental quantitative proteomics data as the primary source and Gene Ontology Cellular Compartment (GO CC) terms as the auxiliary source. Using this TL approach, we have exploited auxiliary data to improve upon the protein localisation prediction from quantitative MS-based spatial proteomics experiments using (1) a class-weighted k-NN classifier, and (2) an SVM classifier in a TL framework. We also show the flexibility of the framework by using data from the Human Protein Atlas [57] and input sequence and annotation features from the YLoc [58, 59] web server, and protein-protein interaction data from the STRING database [54] as auxiliary data sources.

To assess classifier performance we employed the classic machine learning schema of partitioning our labelled data into training and testing sets, and used the testing sets to assess the strength of our classifiers. Parameter optimisation was conducted on the labelled training data using 100 rounds of stratified 80/20 partitioning, in conjunction with 5-fold cross-validation in order to estimate the free parameters via a grid search, as implemented in the pRoloc package [56] (and described in the methods below). Here, for the k-NN TL algorithm these parameters are the weights assigned to each class for each data source, and for the SVM TL algorithm these are C, γP and γA for the two kernels, as described in the materials and methods. The testing set is then used to assess the generalisation accuracy of the classifier. By applying the best parameters found in the training phase on test data, observed and expected classification results can be compared, and then used to assess how well a given model works by getting an estimate of the classifier’s ability to achieve a good generalisation, that is given an unknown example predict its class label with high accuracy. This schema was repeated for all 5 datasets, and for the SVM and k-NN classifiers, trained on (i) LOPIT alone, and (ii) GO CC alone, for comparison with the TL algorithms.

For simplicity, throughout this manuscript we refer to the mouse pluripotent embryonic stem cell dataset as the ‘mouse dataset’, the human embryonic kidney fibroblast dataset as the ‘human dataset’, the Drosophila embryos dataset as the ‘fly dataset’, the Arabidopsis thaliana callus dataset as the ‘callus dataset’ and finally the second Arabidopsis thaliana roots dataset, as the ‘roots dataset’.

The k-NN transfer learning classifier

The median macro-F1 scores for the mouse, human, callus, roots and fly datasets were 0.879, 0.853, 0.863, 0.979, 0.965, respectively, for the combined k-NN transfer learning approach. A two sample t-test showed that over 100 test partitions, the mean estimated generalisation performance for the k-NN transfer learning approach was significantly higher than on profiles trained solely from only primary or auxiliary alone for the mouse (p = 2e−21 for primary alone and p = 7e−78 for auxiliary alone), human (p = 1e−7 for primary alone and p = 8e−32 for auxiliary alone), plant roots (p = 4e−17 for primary alone and p = 4e−22 for auxiliary alone), and fly (p = 3e−5 for primary alone, p = 1e−112 for auxiliary alone) data (Fig 1). We found that the plant callus dataset did not significantly benefit (nor detrimentally affected) by the incorporation of auxiliary data. This was unsurprising as this dataset is extremely well-resolved in LOPIT (Fig A in S1 File, top right) and the median macro F1-score over 100 rounds of training and testing with a baseline k-NN classifier resulted in a median macro F1-score of 0.985 (the combined approach yielded a macro F1-score of 0.973).

thumbnail
Fig 1. Boxplots, displaying the estimated generalisation performance over 100 test partitions.

Results for the k-NN transfer learning algorithm applied with (i) optimised class-specific weights (combined), (ii) only primary data and (iii) only auxiliary data, for each dataset.

https://doi.org/10.1371/journal.pcbi.1004920.g001

The k-NN transfer learning classifier uses optimised class weights to control the proportion of primary to auxiliary neighbours to use in classification. One advantage of this approach is the ability for the user to set class weights manually, allowing complete control over the amount of auxiliary data to incorporate. As previously described, the class weights can be set through prior optimisation on the labelled training data. Fig 2 shows the detailed results for the mouse dataset and the distribution of the 100 best weights selected over 100 rounds of optimisation are shown on the top left. We found the distribution of weights for each dataset reflected closely the sub-cellular resolution in each experiment. For example, for the experiment on the mouse dataset the distribution of best weights identified for the endoplasmic reticulum (ER), mitochondria and chromatin niches are heavily skewed towards 1 indicating that the proportion of neighbours to use in classification should be predominantly primary. Note, as described in the methods if the class weight is assigned to 1, then strictly only neighbours in primary data are used in classification and similarly, if the class weight is 0 then all weight is given to the auxiliary data. If the weight falls between these two limits the neighbours in both the primary and auxiliary data sources is considered. From examining the principal components analysis (PCA) plot (Fig 2, top right) we indeed found that these organelles are well separated in the LOPIT experiment. Conversely, we found that the 40S ribosome overlaps somewhat with the nucleus (non-chromatin) cluster (Fig 2, top right) which is reflected in the best choice of class weights for these two niches; they were both assigned best weights of 1/3 and their weight distributions are skewed towards 0 indicating that more auxiliary data should be used to classify these sub-cellular classes. If we further examine the class-F1 scores for these two sub-cellular niches (Fig 2, bottom) we indeed find that including the auxiliary data in classification yields a significant improvement in generalisation accuracy (p = 1e−16 for 40S ribosome (red) and p = 1e−10 for the nucleus (non-chromatin) (pink)). We also found this to be the case for the proteasome, which is overlapping with the cytosol. We found LOPIT alone did not distinguish between these two sub-cellular niches in this particular experiment, however, the addition of auxiliary data from the Gene Ontology resulted in a significant increase in classifier prediction (p = 2e−16) as shown by the class-specific box plot in Fig 2, bottom (black).

thumbnail
Fig 2. Visualisation of k-NN TL results.

Top left: Bubble plot, displaying the distribution of the optimised class weights over the 100 test partitions for the transfer learning algorithm applied to the mouse dataset. Top right: Principal components analysis plot (first and second components, of the possible eight) of the mouse dataset, showing the clustering of proteins according to their density gradient distributions. Bottom: Sub-cellular class-specific box plots, displaying the estimated generalisation performance over 100 test partitions for the transfer learning algorithm applied with (i) optimised class-specific weights (combined), (ii) only primary data and (iii) only auxiliary data, for each sub-cellular class.

https://doi.org/10.1371/journal.pcbi.1004920.g002

Many experiments are specifically targeted towards resolving a particular organelle of interest (e.g. the TGN in the roots dataset) which requires careful optimisation of the LOPIT gradient. In such a setup sub-cellular niches other than the one of interest may not be well-resolved which may simply be due to the fact that the gradient was not optimised for maximal separation of all sub-cellular niches, but only one or a few particular organelles. Such experiments in particular may benefit from the incorporation of auxiliary data. We found that for the roots dataset all sub-cellular classes, except the TGN sub-compartment, benefitted from including auxiliary data (Fig C in S1 File, bottom), highlighting the advantage of using more than one source of information for sub-cellular protein classification. The best weight for the TGN was found to be 1 (Fig C in S1 File, top left), as expected and indicating high resolution in LOPIT for this class. In this framework we are able to resolve different niches in the data according to different data sources, further highlighted in the class-specific boxplots in Figs A-D in S1 File.

The SVM transfer learning classifier

Adapting Wu and Dietterich’s classic application of transfer learning [6] we have implemented an SVM transfer learning classifier that allows the incorporation of a second auxiliary data source to improve upon the classification of experimental and condition-specific sub-cellular localisation predictions. The method employs the use of two separate kernels, one for each data source. As previously described, to assess the generalisation accuracy of our classifier we employed the classic machine learning schema of partitioning our labelled data into training and testing sets, and used the testing sets to assess the strength of our classifiers. This was repeated on 100 independent partitions for (i) the SVM TL method, (ii) a standard SVM trained on LOPIT alone, and (iii) a standard SVM trained on GO CC alone.

For the SVM TL experiments the resultant median macro-F1 scores for the mouse, human, callus, roots and fly datasets were 0.902, 0.868, 0.956, 0.875, 0.961, respectively, over the 100 partitions. As per the k-NN TL, we found the macro-F1 scores for the SVM TL Fig A in S2 File) were significantly higher than on profiles trained solely from only primary or auxiliary alone; mouse (p = 5e−56 for primary alone and p = 6e−37 for auxiliary alone), human (p = 7e−3 for primary alone and p = 1e−21 for auxiliary alone), callus (p = 4e−3 for primary alone and p = 1e−92 for auxiliary alone), roots (p = 2e−45 for primary alone and p = 7e−25 for auxiliary alone), and fly (p = 3e−3 for primary alone and p = 4e−105 for auxiliary alone) data. This was also evident on the organellar level as seen in the supporting figures in the S2 File.

Other auxiliary data sources

One of the advantages of the transfer learning framework is the flexibility to use different types of information for both the primary and auxiliary data source. We demonstrate the flexibility of this framework by testing other complementary sources of information as an auxiliary data source.

The Human Protein Atlas.

The sub-cellular Human Protein Atlas [57] provides protein expression patterns on a sub-cellular level using immunofluorescent staining of human U-2 OS cells. As described in the materials and methods the hpar Bioconductor package [60] was used to query the sub-cellular Human Protein Atlas [57] (version 13, released on 11/06/2014). This auxiliary data, to be integrated with our human LOPIT experiment, was encoded as a binary matrix describing the localisation of 670 proteins in 18 sub-cellular localisations. Information for 192 of the 381 labelled marker proteins were available. These 192 proteins covered 8 of the 10 known localisations in the human LOPIT experiment and were used to estimate the classifier generalisation accuracy of (i) the transfer learning approach with both data sources, (ii) the LOPIT data alone and (iii) the HPA data alone, as described previously. As detailed in the supplementary information (Fig A in S3 File), we observed a statistically significant improvement in our overall classification accuracy as well as several positive organelle-specific results.

YLoc sequence and annotation features.

Sequence and annotation features, that were used as input from the computational classifier YLoc [58, 59] (see materials and methods, Table 1) were selected as an auxiliary data source to complement the LOPIT mouse stem cell dataset. 34 sequence and annotation features were selected using a correlation feature selection, as described in the materials and methods. Using the LOPIT mouse dataset as our primary data, and the 34 YLoc features as our auxiliary we employed the standard protocol for testing classifier performance (i) using the k-NN transfer learning with both data sources, (ii) the primary data alone and (iii) the auxiliary data alone. Although we did not observe a statistically significant improvement using the auxiliary data in the transfer learning framework, we did not see any statistically significant disadvantage in combining information (Fig B in S3 File). Thus we found that incorporating data from auxiliary sources in this framework does not dilute any strong signals in the original experiment, demonstrating the flexibility of the classifier.

thumbnail
Table 1. A summary of the types of features considered in training and building Briesemeister et al’s YLoc classifier.

https://doi.org/10.1371/journal.pcbi.1004920.t001

Protein-protein interaction data.

Protein-protein interaction data was retrieved from the STRING database [54] (version 10) in the human data set. An interaction contingency matrix was constructed using the STRING combined scores (see methods). Interaction scores for 1109 possible interaction partners were available for 99 of the 381 markers. As described for the other sources above, using this protein-protein interaction information as an auxiliary data source we employed the standard protocol for testing classifier performance (i) using the k-NN transfer learning with both data sources, (ii) the primary data alone and (iii) the auxiliary data alone. As per the YLoc data we did not observe a statistically significant increase in combining auxiliary information with our primary data using transfer learning, however, we did not see any statistically significant disadvantage (Fig C in S3 File.

Biological application

We applied the two transfer learning classifiers to a real-life scenario, using the E14TG2a mouse stem cell dataset as our use-case to (i) demonstrate algorithm application, and (ii) highlight the applicability of the method for predicting protein localisation in MS-based spatial proteomics data over other single-source classifiers.

Sub-cellular protein localisation prediction in mouse pluripotent embryonic stem cells.

The E14TG2a mouse stem cell LOPIT dataset contained 387 labelled and 722 unlabelled protein protein profiles distributed among 10 sub-cellular niches (Table A in S6 File). Following extraction of the GO CC auxiliary data matrix for all proteins in the dataset the following five classifiers were applied (1) a k-NN (with LOPIT data only), (2) the k-NN TL (Breckels)(with LOPIT and GO CC data), (3) an SVM (with LOPIT data only) and (5) the SVM TL (Breckels)(with LOPIT and GO CC data) and the parameters for each optimised (see methods) for the prediction of the sub-cellular localisation of the unlabelled proteins in the dataset.

In supervised machine learning the instances which one wishes to classify can only be associated to the classes that were used in training. Thus, it is common when applying a supervised classification algorithm, wherein the whole class diversity is not present in the training data, to set a specific score cutoff on which to define new assignments, below which classifications are set to unknown/unassigned. The pRoloc tutorial, which is found in the set of accompanying vignettes in the pRoloc package [56], describes this procedure and how to implement this in practice. Deciding on a threshold is not trivial as classifier scores are heavily dependent upon the classifier used and different sub-cellular niches can exhibit different score distributions.

To validate our results and calculate classification thresholds based on a 5% false discovery rate (FDR) for each of the five classifiers (i.e. k-NN, k-NN TL(Breckels), k-NN TL (Wu), SVM, SVM TL (Breckels)) we compared the predicted localisations with the localisation of the same proteins found in the highest resolution spatial map of mouse pluripotent embryonic stem cells to date [61]. From examining the overlap between our new classifications and the localisations in the high resolution mouse map we found 183 of our 722 unlabelled proteins matched a high confidence localisation in the new dataset. Of the remaining, 347 of our proteins were labelled as unknown in the mouse map (i.e. were assigned a low confidence localisation in the experiment), and 192 proteins did not appear in the map. We used the localisation of these 183 high confidence proteins as our gold standard on which to validate our findings and set a FDR for our predictions.

Increasing classifier discrimination power.

Fig A in S4 File shows the score distributions for correct and incorrect assignments of the unassigned proteins in the dataset (as validated through the hyperLOPIT mouse map [61]) and the distribution of the scores per classifier. Note, the scores are not a reflection of the classification power and the score distributions between the different methods are not comparable to one another as they are calculated using different techniques. For both of the single-source k-NN and SVM classifiers there is a large overlap in the distribution of scores for correct and incorrect assignments (Fig A in S4 File). It is desirable to have a distribution of scores that enables one to choose a cutoff that minimises the FDR. What is evident from examining the score distributions of incorrect and correct assignments is that by using transfer learning we have increased the discrimination power of the classifier and thus lowered our FDR.

This is further highlighted by receiver-operator characteristics (ROC) analysis (Fig 3) in which the performance of the five different classifiers is displayed for different scoring thresholds. When given a specific score cutoff, the ROC curve plots the true-positive rate (TPR) versus the false-positive rate (FPR) for each classifier. We calculated the area under the ROC curve (AUC) for each classifier and found the AUC for the k-NN, k-NN TL (Breckels), k-NN TL (Wu), SVM and SVM TL (Breckels) classifiers were 0.693, 0.746, 0.711, 0.705 and 0.822, for each classifier respectively.

thumbnail
Fig 3. Receiver-operator characteristics (ROC) analysis.

The performance of the 5 different algorithms for varying scoring thresholds. For a specific score cutoff, the ROC curve shows the true-positive rate (TPR) versus the false-positive rate (FPR) for each classifier. We calculated the area under the ROC curve (AUC) for each curve; the AUC for the k-NN, k-NN TL (Breckels), k-NN TL (Wu), SVM and SVM TL (Breckels) classifiers were 0.693, 0.746, 0.711, 0.705 and 0.822, for each classifier respectively.

https://doi.org/10.1371/journal.pcbi.1004920.g003

Using our knowledge of the correct/incorrect outcomes of these 183 previously unlabelled proteins we calculated an appropriate threshold at which to classify all unlabelled proteins. Using a FDR of 5% we found assignment thresholds for the SVM (0.85), SVM TL (0.785) and k-NN TL (0.805) to classify the remaining unlabelled proteins. A FDR of 5% was not possible with the k-NN classifier, and the lowest achievable FDR was 15%, which occurred using the strictest threshold of 1 i.e. only when all 5 nearest neighbours agreed. Comparing the classifications made from the single-source classifiers to those made with the transfer learning methods, we found in both cases we get many more assignments using the combined transfer learning approaches compared to the single-source methods using a fixed FDR of 5%, as discussed below.

Fig 4 shows the SVM and SVM TL scores assigned to each of the 183 validated proteins. The sub-cellular class is highlighted by solid colours and an un-filled point on the plot represents the case where the two classifiers disagreed on the sub-cellular localisation. We found that the SVM TL classifier gave 70% more high-confidence classifications with the same 5% FDR threshold than the single-source SVM trained on primary data alone. All proteins that were assigned to a sub-cellular niche with a high confidence score in both the SVM and SVM TL (Fig 4, top right grid) were assigned to the same class. We also found that many proteins outside of the high confidence threshold were assigned the same sub-cellular class using both methods, as indicated by the abundance of solid points on the plot. Of the total 722 previously unlabelled proteins we assigned high confidence localisations for 204 proteins using the SVM TL, and 176 proteins using the k-NN TL method, based on a FDR of 5% (Tables A and B in S4 File).

thumbnail
Fig 4. Scatterplot displaying the scores for the SVM and SVM TL classifiers for the 183 proteins validated by the hyperLOPIT mouse map [61].

Each point represents one protein and its associated classifier scores. Filled circles highlight proteins that were assigned the same sub-cellular class with each classifier, empty circles represent the instance when the two classifiers gave different results. The solid lines show the classification boundaries for the two classifiers at a 5% FDR, above which proteins are classified to the highlighted class, below these boundaries proteins are deemed low confidence and thus left unassigned.

https://doi.org/10.1371/journal.pcbi.1004920.g004

New findings.

By way of biological validation we investigated the additional protein assignments that were found using the SVM TL method (Fig 4, bottom right grid) as novel assignments to one of these classes, the plasma membrane, by searching through the literature for supporting empirical evidence. For example, using the SVM TL method we found four new proteins (GTR3_MOUSE, SNTB2_MOUSE, PAR6B_MOUSE and ADA17_MOUSE) assigned only to the plasma membrane with the SVM TL method (Fig 5) that were also assigned to the plasma membrane in the recent high resolution mouse map [61] (Fig B in S4 File). Dehydroascorbic acid transporter (GTR3_MOUSE) is a multi-pass membrane protein which has been previously shown to be a plasma membrane protein in studies isolating the cell surface glycoprotein in Jurkat cells [62]. Beta-2 syntrophin or syntrophin 3 (SNTB2_MOUSE) is a phosphoprotein with PDZ domain through which it interacts with ion channels and receptors. There are confounding reports of the sub-cellular location of this peripheral protein. It associates with dystrophins and has no signal sequence. It is found mostly in muscle fibres and brain [63], but to date, its role has not been studied in mouse embryonic stem cells. Given its association with ion channels and receptors, it is perfectly feasible that the steady location of this protein in stem cells is the plasma membrane. Partitioning defective 6 homolog beta (PAR6B_MOUSE) is a peripheral membrane protein thought to be in complex with E-cadherin, aPKC, and Par3 at the plasma membrane [64], where it functions to guide GTP-bound Rho small GTPases to atypical protein kinase C proteins [65]. Disintegrin and metalloproteinase domain-containing protein 17 (ADA17_MOUSE) is a single pass plasma membrane protein which functions to cleave the intracellular domain of various plasma membrane proteins including notch and TNF-alpha [66]. It is therefore involved in the upstream events in several signalling pathways. It has a 17 amino acid N-terminal signal sequence suggestive of its function as a membrane protein. The full list of localisation predictions for all proteins in the mouse dataset can be found in the R data package pRolocdata.

thumbnail
Fig 5. Principal components analysis plot (PCA) of the mouse stem cell dataset.

Proteins are clustered according to their density gradient distributions. Each point on the PCA plot represents one protein. Filled circles are the original protein markers used in classification, hollow circles show new locations as assigned by the SVM TL classifier. The 4 proteins GTR3_MOUSE, SNTB2_MOUSE, PAR6B_MOUSE and ADA17_MOUSE that were found in the SVM TL method and not in an SVM classification with LOPIT only are highlighted.

https://doi.org/10.1371/journal.pcbi.1004920.g005

A comparison of transfer leaning algorithms

We compared the macro- and class-F1 scores from all experiments on all 5 datasets used to assess the classifier performance of the k-NN TL and SVM TL methods. We found that no single method systematically outperformed the other, as described in S5 File.

When applying the SVM TL and k-NN TL classifiers to the unlabelled proteins (see biological validation) an analysis of the final assignments (as classified based on a FDR of 5%) showed that the predicted protein localisations were in high agreement. Although there were no protein-organelle assignment mismatches between TL methods we did find a few cases where one TL method would assign a protein to one of the sub-cellular classes but the other TL method did not result in any organelle assignment, due to low classification scores (see Table C in S4 File). Overall, we did not find any contradicting sub-cellular class assignments.

We also compared Wu’s original k-NN algorithm against our TL methods. Wu’s method was significantly outperformed by our k-NN TL method for the mouse (p = 4e−4) and roots dataset (p = 4e−3) and by our SVM TL algorithm for the mouse (p = 7e−13), roots (p = 7e−8), and human (p = 0.004) datasets (see Figs F and G in S5 File).

Discussion

In this study we have presented a flexible transfer learning framework for the integration of heterogeneous data sources for robust supervised machine learning classification. We have demonstrated the biological usage of the framework by applying these methods to the task of protein localisation prediction from MS-based experiments. We further show the flexibility of the framework by applying these methods to the five different spatial proteomics datasets, from four different species, in conjunction with four different auxiliary data sources to classify proteins to multiple sub-cellular compartments. We find the two different classifiers—the k-NN TL and SVM TL—perform equally well and importantly both of these methods outperform a single classifier trained on each single data source alone. We further applied the algorithm to a real-life use case, to classify a set of previously unknown proteins in a spatial proteomics experiment on mouse embryonic stem cells, which was validated using the most high resolution map of the mouse E14TG2a stem cell proteome produced to date [61]. We find integrating data from a second data source directly into classifier training and classifier creation results in the assignment of proteins to organelles with high generalisation accuracy. Finally, we find that using freely available data from repositories we can improve upon the classification of experimental and condition-specific protein-organelle predictions in an organelle-specific manner.

To our knowledge, no other method has been developed to date that allows the incorporation of an auxiliary data source for the primary task of predicting sub-cellular localisation in spatial proteomics experiments. In this study we have developed methods that not only allow the inclusion of an auxiliary data source in localisation prediction, but we have created a flexible framework allowing the use of many different types of auxiliary information, and furthermore allowing the user complete control over the weighting between data sources and between specific classes. This is extremely important for the analysis of biological data in general, and spatial proteomics data in particular, as many experiments are targeted towards resolving specific biologically relevant aspects (sub-cellular niches in spatial proteomics) and thus users may wish to control the impact of auxiliary information for aspects that have been specially targeted for analysis by the primary experimental method. In this context the setting of weights manually in the k-NN transfer learning classifier allows users complete power to explicitly choose whether to call upon an auxiliary data source or simply use data from their own experiment, on an organelle-by-organelle basis.

The effectiveness of using databases as an auxiliary data source will depend greatly on abundance and quality of annotation available for the species under investigation. For example, human is a well-studied species and there is a large amount of information available in the Gene Ontology and Human Protein Atlas. Furthermore, some organelles are easier to enrich for and thus there exists much more information available to utilise as an auxiliary source on a organelle by organelle basis. The transfer learning methods we present here allow the inclusion of any type of auxiliary data, provided of course there is information available for the proteins under investigation.

The integration of auxiliary data sources is a double-edged sword. On the one hand, it can shed light on (i) the primary classification task by reinforcing weak patterns or (ii) complement the signal in the primary data. On the other hand however it is easy to dilute valuable signals in an expensive experiment by shadowing the uniqueness, and hence biological relevance of the experimental primary data when integration is not performed with care, a phenomenon coined negative transfer (see Fig H in S5 File). Thus one needs to be cautious with data integration in general and not overlook the biological relevance of the primary data. Here, we provide a solution to this issue by using transfer learning: the k-NN transfer learning classifier uses optimised class-specific weights so as not to penalise any strong signals in the primary, if no signal is found in the auxiliary; similarly, the SVM transfer learning method uses optimised data-specific gamma parameters for each data-specific kernel.

The transfer learning framework forms part of the open-source open-development Bioconductor pRoloc suite of computational methods available for organelle proteomics data analysis. Moreover, as the pipeline utilises the formal Bioconductor classes, different data types, for example from gene expression technologies among others, can be easily used in this framework. The integration of different data sources is one of the major challenges in the data intensive world of computational biology, and here we offer a flexible and powerful solution to unify data obtained from different but complimentary techniques.

Materials and Methods

Primary data

Five datasets, from studies on Arabidopsis thaliana [7, 15], Drosophila embryos [17], human embryonic kidney fibroblast cells [20], and mouse pluripotent embryonic stem cells (E14TG2a) [56] were collected using the standard LOPIT approach as described by Sadowski et al. [12]. In the LOPIT protocol, organelles and large protein complexes are separated by iodixanol density gradient ultracentrifugation. Proteins from a set of enriched sub-cellular fractions are then digested and labelled separately with iTRAQ or TMT reagents, pooled, and the relative abundance of the peptides in the different fractions is measured by tandem MS. The number of measurements obtained per gradient occupancy profile (which comprises of a set of isotope abundance measurements) is thus dependent on the reagents and LOPIT methodology used.

The first Arabidopsis thaliana dataset [7] on callus cultures employed dual use of four isotopes across eight fractions and thus yielded 8 values per protein profile. The aim of this experiment was to resolve Golgi membrane proteins from other organelles. Gradient-based separation was used to facilitate this, including separating and discarding as much nuclear material as possible during a pre-centrifugation step, and carbonate washing of membrane fractions to remove peripherally associated proteins, thereby maximising the likelihood of assaying less abundant integral membrane proteins from organelles involved in the secretory pathway.

The second Arabidopsis thaliana dataset on whole roots is one of the replicates published by Groen et al. [15], which was set up to identify new markers of the trans-Golgi network (TGN). The TGN is an important protein trafficking hub where proteins from the Golgi are transported to and from the plasma membrane and the vacuole. The dynamics of this organelle are therefore complex which makes it a challenge to identify true residents of this organelle. For each replicate, sucrose gradient fractions were subjected to a carbonate wash to enrich for membrane proteins and four fractions were iTRAQ labelled. Following MS the resultant iTRAQ reporter ion intensities for the four fractions were normalised to six ratios and then each protein’s abundance was further normalised across its six ratios by sum. In Groen’s original experiment the iTRAQ quantitation information for common proteins between the three different gradients were concatenated to increase the resolution of the TGN [23].

The aim of the Drosophila experiment [17] was to apply LOPIT to an organism with heterogeneous cell types. Tan et al. [17] were particularly interested in capturing the plasma membrane proteome (personal communication). There was a pre-centrifugation step to deplete nuclei, but no carbonate washing, thus peripheral and luminal proteins were not removed. In this experiment four isotopes across four distinct fractions were implemented and thus yielded four measurements (features) per protein profile.

The human dataset [20, 67] was a proof-of-concept for the use of LOPIT with an adherent mammalian cell culture. Human embryonic kidney fibroblast cells (HEK293T) were used and LOPIT was employed with 8-plex iTRAQ reagents, thus returning eight values per protein profile within a single labelling experiment. As in the LOPIT experiments in Arabidopsis and Drosophila, the aim was to resolve the multiple sub-cellular niches of post-nuclear membranes, and also the soluble cytosolic protein pool. Nuclei were discarded at an early stage in the fractionation scheme as previously described, and membranes were not carbonate washed in order to retain peripheral membrane and lumenal proteins for analysis.

The E14TG2a embryonic mouse dataset [56] also employed iTRAQ 8-plex labelling, with the aim of cataloguing protein localisation in pluripotent stem cells cultured under conditions favouring self-renewal. In order to achieve maximal coverage of sub-cellular compartments, fractions enriched in nuclei and cytosol were included in the iTRAQ labelling scheme, along with other organelles and large protein complexes as for the previously described datasets. No carbonate wash was performed.

For validation of the predicted localisations made using the transfer learning classifiers on the E14TG2a dataset above, a new high resolution mouse map was used as a gold standard [61]. This high resolution map was generated using hyperplexed LOPIT (hyperLOPIT), a novel technique for robust classification of protein localisation across the whole cell. The method uses an elaborate sub-cellular fractionation scheme, enabled by the use of Tandem Mass Tag (TMT) 10-plex and application of a recently introduced MS data acquisition technique termed synchronous precursor selection MS3 (SPS)-MS3 [68], for high accuracy and precision of TMT quantification. The study used state-of-the-art data analysis techniques [56, 67] combined with stringent manual curation of the data to provide a robust map of the mouse pluripotent embryonic stem cell proteome. The authors also provide a web interface to the data for exploration by the community through a dedicated online R shiny [69] application (https://lgatto.shinyapps.io/christoforou2015).

All datasets are freely distributed as part of the Bioconductor [55]pRolocdata data package [56].

Auxiliary data

The Gene Ontology.

The Gene Ontology (GO) project provides controlled structured vocabulary for the description of biological processes, cellular compartments and molecular functions of gene and gene products across species [4]. For each protein seen in every LOPIT experiment the protein’s associated Gene Ontology (GO) cellular component (CC) namespace terms were retrieved using the pRoloc package [56]. Given all possible GO CC terms associated to the proteins in the experiment we constructed a binary matrix representing the presence/absence of a given term for each protein, for each experiment.

Human Protein Atlas.

The Human Protein Atlas (HPA) [57] (version 13, released on 11/06/2014) was used as an auxiliary source of information to complement the human LOPIT dataset. The sub-cellular HPA provides protein expression patterns on a sub-cellular level using immunofluorescent staining of human U-2 OS cells. We used the hpar Bioconductor package [60] to query the atlas. The data was encoded as a binary matrix describing the localisation of 670 proteins in 18 sub-cellular localisations. In the HPA the reliability of annotated protein expression data is given a status of supportive or uncertain, dependent on similarity to immunostaining patterns and consistency with available experimental gene/protein characterisation data in the UniProtKB database. Here, we only localisations that have been supportively identified.

YLoc classifier features.

YLoc [58, 59] is an interpretable web server developed by Briesemeister and co-workers for the prediction of protein sub-cellular localisation. The YLoc classifier uses numerous features derived from sequence and annotation. A summary of the features included in the YLoc classifier is shown in Table 1. These features provide a source of complementary auxiliary data for the high quality MS based datasets described above. To use these features as an auxiliary source of information, a large-scale correlation-based feature selection (CFS) approach [70], as described in [58, 59], was used with the markers from the mouse dataset to find the set of the most important features.

Protein-protein interaction data.

The STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database [54] contains known and predicted protein interactions and quantitatively integrates interaction data from direct (physical) and indirect (functional) associations for a large number of organisms, including human. We have queried the STRING database (version 10) with protein accessions and retrieved the interaction partners of proteins in the human LOPIT data. For each of these proteins, an interaction was recorded and scored using the STRING combined interaction score which was then used to construct an interaction contingency matrix to use as an auxiliary data source. For the 1371 proteins in our human dataset, 520 proteins (99 markers) displayed interactions, which were used in classifier testing.

The creation of the auxiliary datasets are documented and demonstrated using executable code in the pRoloc-transfer-learning vignette.

The definition of primary and auxiliary is not defined algorithmically by the quality or the size of the data, but rather by the data and question at hand. Here, LOPIT was considered the primary data because it represented the experiment of interest that was to be complemented by the auxiliary data. In fact, from an algorithmic point of view, primary and auxiliary are reciprocal.

Markers

Spatial proteomics relies extensively on reliable sub-cellular protein markers to infer proteome wide localisation. Markers are proteins that are defined as reliable residents and can be used as reference points to identify new members of that sub-cellular niche. Here, marker proteins are selected by domain experts through careful mining of the literature. Markers for each LOPIT experiment were specific to the system under study and conditions of interest and are distributed as part of the Bioconductor [55] pRoloc package [56].

Notation

The primary MS-based experimental datasets P consist of multivariate protein profiles. The auxiliary data A is a presence/absence binary matrix of Gene Ontology Cellular Compartment (GO CC) terms. Data are annotated to either (i) a single known organelle (labelled data), or (ii) have unknown localisation (unlabelled data). Thus we split P and A into labelled (L) and unlabelled (U) sections such that P = (LP, UP) and A = (LA, UA).

The labelled examples for P and A are represented by LP = {(xl, yl)|l = 1, …, |LP|} where , and LA = {(vl, yl)|l = 1, …, |LA|} where . Thus each lth protein is described by vectors of S and T features (generally, S << T, for P and A respectively. Each dataset shares a common set of proteins that is annotated to one of the same ylC = {1, …, |C|} sub-cellular classes, where is the total number of sub-cellular classes. Unlabelled data, UP and UA are represented by UP = {xu|u = 1, …, |UP|} where and UA = {vu|u = 1, …, |UA|} where , respectively.

The labelled data for the ith organelle class, with Ni indicating the number of proteins for the ith organelle class, is given for P by and for A by . The labelled dataset of all available proteins over the |C| different sub-cellular classes is given for P by and for A by .

Transfer learning using a k-nearest neighbours framework

We adapt Wu and Dietterich’s [6] classic application of inductive transfer using experimental quantitative proteomics data as the primary source (P) and GO CC terms as the auxiliary source (A). We aim to exploit auxiliary data to improve upon the sub-cellular classification of proteins found in MS-based LOPIT experiments in an organelle-specific way, using the baseline k-nearest neighbours (k-NN) algorithm in a transfer learning framework.

In k-NN classification, an unknown example is classified by a majority vote of its labelled neighbours, with the example being assigned to the class most common among its k nearest neighbours. Independent of the transfer learning classifier we compute the best k for each data source for values k ∈ {3, 5, 7, 9, 11, 13, 15} through an initial 100 rounds of 5-fold cross-validation using each set of labelled training data for P and then independently for A (as implemented in pRoloc). We denote by kP the best k for P, and by kA the best k for A.

Having obtained the best k for each data source, the transfer learning algorithm works as follows. For the uth protein (xu,vu) we wish to classify in U, we start by finding the kP and kA labelled nearest neighbours for xu and vu in LP and LA, respectively. Denote these sets and . We then define the vectors and to contain counts for each class in the sets of nearest neighbours; that is, For each protein, let and be normalized vectors with elements summing to 1 and representing the distribution of classes among the sets of nearest neighbours for each protein. Finally, let and .

To include both the primary and auxiliary data in the set of potential neighbours we took a weighted combination of the votes in NNP and NNA for each sub-cellular class. Class weights are defined by the parameter vector θT = (θ1, …, θ|C|) with values chosen by optimisation through a prior 100 independent rounds of 5-fold cross-validation on a separate training partition of the labelled data. For the uth unknown protein (xu,vu) in U, the voting scores for each class iC are calculated as (1) and the protein is assigned to the class cC maximizing V(i) The class weights θi in Eq 1 control the relative importance of the two types of neighbours for each class iC. This differs from Wu and Dietterich’s [6] original approach as they only weight the data sources and not the classes and the data sources. In this paper we select each class weight θi from the set ; however, the algorithm allows us to use any real-valued θi ∈ [0, 1]. If θi = 1, then all weight is given to the primary data in class i and only primary nearest neighbours in class i are considered. Similarly, if θi = 0, then all weight is given to the auxiliary data in class i and only auxiliary nearest neighbours in class i are considered. If 0 < θi < 1 then a combination of neighbours in the primary and auxiliary data sources is considered.

Transfer learning using an SVM framework

Linear programming SVMs.

The method is based on the use of the linear programming formulation of the SVM (lpSVM). This formulation promotes classifiers that are sparse, in the sense that where possible only a few parameters obtained through training are non-zero; for a detailed introduction see Mangasarian [72].

We begin by describing the standard lpSVM used for classical two-class classification problems with a single labelled training set. We use the multiple-class version of this approach with the individual primary and auxiliary sets P and A as a comparison later in the paper; we present the method here assuming that the primary set P is being used and can be set up as a binary classification problem; for example, we might wish to predict whether or not a protein should be assigned to a single specified sub-cellular class. For binary classification problems with class labels y ∈ {+1, −1}, and given labelled data LP = {(xl, yl)|l = 1, …, m} where m = |LP| the classifier takes the form (2) where f is the latent function Here, KP is a kernel (Shawe-Taylor and Cristianini [73]) associated with the primary data and and b are parameters determined by training.

For any vector xT = (x1, …, xn) let |.|1 denote the 1-norm The training algorithm requires that we solve the linear programme (3) such that for each i = 1, …, m and αP, ξ0. The parameters ξ and C act in the same way as the corresponding parameters in the standard SVM: ξ contains the slack variables allowing some examples to be misclassified, and C controls the extent to which such misclassifications are penalized during training.

Note that it is possible for the linear programme to have no solution, although we found this to be extremely rare. When this was the case the classifier reverted to predicting the most common class in the labelled data.

Transfer learning for binary classification.

Once again we adapt the method of Wu and Dietterich [6] to our problem. The original method requires adaptation as it is designed for data having two important differences compared with ours. First, it does not require examples in the labelled data sets LP and LA to be in correspondence and for corresponding training examples to share the same label. Second it assumes that P and A share the same number of features. While the first of these differences is easily dealt with as our data is a special case that is already covered, the second is more problematic. If we now introduce the labelled auxilliary data LA = {(vl, yl)|l = 1, …, m} a direct application of the approach in [6] requires us to evaluate kernels of the form K(x, v). As P and A contain data with different numbers of features this presents a problem for any SVM-type method, as kernels are usually required to satisfy the Mercer conditions (Mercer [74]), one of which is that they are symmetric, such that K(x, x′) = K(x′, x). While research on the use of asymmetric kernels has appeared—see for example [75]—even if we relax this requirement a kernel is essentially a measure of the similarity of its arguments, and the question arises of how one might sensibly measure the similarity of a protein profile with a presence/absence vector of GO CC terms. This problem does not arise with Wu and Dietterich’s data as the two sets they use have the same dimension and are derived in a way that makes measuring similarity straightforward.

We therefore simplify the original method as follows. We maintain the machinery employed above for the primary data, and introduce a separate kernel KA and parameter vector αA for the auxilliary data. A vector to be classified now contains both a protein profile x and a GO vector v. The latent function becomes and training requires us to solve the linear program (4) such that for each i = 1, …, m and αP, αA, ξ0.

Note that this differs from the method of Multiple Kernel Learning (MKL) (Lanckriet et al. [76], Gönen and Alpaydin [77]) in that in MKL the single kernel K is replaced in the usual SVM formulation by a weighted sum of kernels where di ≥ 0 and . The di are then included with α and b in a more involved constrained optimisation problem. Our approach has the advantages that it remains a straightforward linear program and in fact introduces fewer constraints on the form of the latent function f.

Throughout our experiments we used for KP and KA the Gaussian kernel where ||.|| denotes the 2-norm . We optimized over the value of C, and also separate values γP and γA for the two kernels as described below, with C in the range {0.125, 0.25, 0.5, 1, 2, 4, 8, 16} and γP, γA in the range {0.01, 0.1, 1, 10, 100, 1000}.

Multiple classes, class imbalance and probabilistic outputs.

As a baseline comparison in our experiments we used a standard SVM as implemented in the package LIBSVM (Chang and Lin [78]). In extending our transfer learning technique to deal with multiple classes and probabilistic outputs we therefore maintained as close a similarity as possible to the methods used by that library.

SVMs and lpSVMs are in their basic form inherently binary classifiers. In order to address multiple-class problems using non-probabilistic outputs such as the one presented here we use the method of Knerr et al. [79]. We train a binary classifier to separate each pair of classes. In order to classify a new example we then take a vote among these binary classifiers, assigning the example to the class with the most votes.

As we typically have several sub-cellular classes the binary classification problems used in constructing the multiple-class classifier are inherently unbalanced. We adjust for this using the method of Morik et al. [80]. In each binary problem let n+ denote the number of positive examples and n the number of negative examples. In the linear programme objective functions (Eqs 3 and 4) we replace the single value for C with the adjusted values for the positive and negative examples respectively. Let S+ denote the set of indices of the positive examples and S the set of indices for the negative examples. The term C|ξ|1 in Eqs 3 and 4 becomes Finally, we prefer to employ probabilistic outputs rather than simply thresholding as in Eq 2. Once again we employ the same techniques as LIBSVM. The method for binary classifiers is presented by Platt [81] and Lin et al. [82], and for multiple-class classifiers by Wu et al. [6].

Assessing classifier generalisation accuracy

In order to evaluate the generalisation accuracy of each transfer learning classifier we employed the following schema in all experiments. A set of LOPIT profiles labelled with known markers, and their counterpart auxiliary GO CC profiles, were separated at random into training (80%) and test (20%) partitions. The split was stratified, such that the relative proportions of each class in each of the two sets matched that of the complete set of data. The test profiles were withheld from classifier training and employed to test the generalisation accuracy of the trained classifiers. On each 80% training partition 5-fold stratified cross-validation was conducted to test all free parameters via a grid search and select the best set of parameters for each classifier. In each experiment, for each dataset, this process of 80/20% stratified splitting, training with 5-fold stratified cross-validation on the 80% and testing on the 20% was repeated 100 times in order to produce 100 sets of macro F1 scores and class-specific F1 scores. The F1 score (He [83]) is a common measure used to assess classifier performance. It is the harmonic mean of precision and recall, where and tp denotes the number of true positives, fp the number of false positives, and fn the number of false negatives. Thus A high macro F1 score indicates that the marker proteins in the test data set are consistently correctly assigned by the algorithm.

To assess whether incorporating an auxiliary data source into classifier training and classifier creation was better than using primary or auxiliary data alone, we conducted three independent experiments for each data source and for each transfer learning method. We used the above schema to assess the generalisation accuracy of using (1) the transfer learning k-Nearest Neighbours (k-NN) classifier, (2) the primary LOPIT data alone, using a baseline k-NN, (3) the auxiliary GO CC data alone, using a baseline k-NN. We repeated this for the lpSVM transfer learning classifier and used a standard SVM with an RBF kernel for single data source experiments. Using these experiments we were able to compare using a simple k-NN versus the transfer learning k-NN, and also the use of a standard SVM versus the combined transfer learning lpSVM approach.

A two-sample two-tailed t-test, assuming unequal variance, was used to assess whether, over the 100 test partitions, the estimated generalisation performance using the optimised class-specific fusion approach was better than using either primary data alone, or auxiliary data alone. A threshold of 0.01 was used in all t-tests to determine significance.

Optimised parameters for the mouse pluripotent embryonic stem cell data.

To classify the 722 unlabelled proteins in the E14TG2a mouse stem cell dataset we performed 100 rounds of stratified 5-fold cross validation on the training partition as detailed above. The best parameters were found to be k = 5 for the k-NN classifier and for the k-NN TL classifier kP = 5, kA = 5 and the best class weights were found to be for the 40S ribosome, 60S ribosome, cytosol, endoplasmic reticulum, lysosome, mitochondria, nucleus—chromatin, nucleus—non-chromatin, plasma membrane and proteasome, respectively. For the SVM classifier we found the best parameters to be C = 16 and γ = 10. For the SVM TL classifier we found C = 16, γP = 1 and γA = 0.1. Using these parameters with their associated algorithms we classified the 722 unlabelled proteins in the dataset and obtained a classifier score for each protein.

Supporting Information

S1 File. Supporting figures for the k-NN transfer learning experiments.

Visualisations of the k-NN transfer learning results for human, plant callus, plant roots and fly datasets. Including bubble plots displaying the distribution of the optimised class weights, principal components analysis plots of the LOPIT primary datasets boxplots displaying the estimated generalisation performance of each classifier.

https://doi.org/10.1371/journal.pcbi.1004920.s001

(PDF)

S2 File. Supporting figures for the SVM transfer learning experiments.

Boxplots displaying the estimated generalisation performance of the SVM transfer learning algorithm applied to the mouse, human, plant callus, plant roots and fly datasets.

https://doi.org/10.1371/journal.pcbi.1004920.s002

(PDF)

S3 File. Supporting figures for other data sources.

Macro- and class-specific results for the k-NN transfer learning algorithm used with the auxiliary Human Protein Atlas dataset, a YLoc sequence and annotation features auxiliary dataset and a protein-protein interactions dataset.

https://doi.org/10.1371/journal.pcbi.1004920.s003

(PDF)

S4 File. Additional figure and tables for biological application.

Boxplots displaying the distribution of classification scores assigned to the unknown proteins in the mouse dataset for each of the 4 classifiers. Principal components analysis plot displaying the protein classification results from applying the k-NN transfer learning algorithm on the unlabelled data in the mouse dataset. Accompanying tables showing the number of sub-cellular assignments of the unlabelled proteins amongst the 10 known sub-cellular classes in the data from applying each transfer learning method. the mouse stem cell dataset highlighting the new localistions found by the k-NN transfer learning method.

https://doi.org/10.1371/journal.pcbi.1004920.s004

(PDF)

S5 File. A comparison of transfer learning methods.

A short comparison between k-NN transfer learning (TL) and SVM TL classifiers, including boxplots displaying the macro- and class-F1 scores for the k-NN TL and SVM TL experiments over the 100 test partitions on each dataset. This supplementary file also includes a comparison with Wu’s original k-NN transfer learning classifier and negative transfer effects are also described.

https://doi.org/10.1371/journal.pcbi.1004920.s005

(PDF)

S6 File. Dataset summary statistics.

Tables displaying the dimensions for each primary and auxiliary dataset, including the total number of proteins identified in each LOPIT dataset and number of known markers of sub-cellular protein location.

https://doi.org/10.1371/journal.pcbi.1004920.s006

(PDF)

Acknowledgments

The authors would also like to thank Dr. Dureid El-Moghraby from the High Performance Computing Service and Dr. Jenny Barna from Bioinformatics and Computational Biology, University of Cambridge, for their support. This work was performed using the Darwin Supercomputer of the University of Cambridge High Performance Computing Service, provided by Dell Inc. using Strategic Research Infrastructure Funding from the Higher Education Funding Council for England and funding from the Science and Technology Facilities Council.

Author Contributions

Conceived and designed the experiments: LMB SBH LG. Performed the experiments: LMB SBH LG. Analyzed the data: LMB SBH LG KSL. Contributed reagents/materials/analysis tools: DW OK. Wrote the paper: LMB SBH DW CMM AC AG OK KSL LG. Contributed data: DW CMM AC MWBT AG OK.

References

  1. 1. Hey T, Tansley S, Tolle K. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft; 2009.
  2. 2. Spreafico R, Mitchell S, Hoffmann A. Training the 21st Century Immunologist. Trends in Immunology. 2015;36(5):283–5. Available from: http://www.sciencedirect.com/science/article/pii/S1471490615000770. pmid:25911462
  3. 3. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015 Jan;43(Database issue):D204–12. pmid:25348405
  4. 4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25–9. pmid:10802651
  5. 5. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015 Jun;16(6):321–32. pmid:25948244
  6. 6. Wu P, Dietterich TG. Improving SVM Accuracy by Training on Auxiliary Data Sources. In: Proceedings of the 21st International Conference on Machine Learning (ICML); 2004.
  7. 7. Dunkley TPJ, Hester S, Shadforth IP, Runions J, Weimar T, Hanton SL, et al. Mapping the Arabidopsis organelle proteome. Proc Natl Acad Sci USA. 2006 Apr;103(17):6518–6523. Available from: http://dx.doi.org/10.1073/pnas.0506958103 pmid:16618929.
  8. 8. Foster LJ, Hoog CLd, Zhang Y, Zhang Y, Xie X, Mootha VK, et al. A mammalian organelle map by protein correlation profiling. Cell. 2006 Apr;125(1):187–199. Available from: http://dx.doi.org/10.1016/j.cell.2006.03.022 pmid:16615899.
  9. 9. Fagerberg L, Strömberg S, El-Obeid A, Gry M, Nilsson K, Uhlen M, et al. Large-scale protein profiling in human cell lines using antibody-based proteomics. J Proteome Res. 2011 Sep;10(9):4066–75. pmid:21726073
  10. 10. De Duve C, Beaufay H. A short history of tissue fractionation. J Cell Biol. 1981 Dec;91(3 Pt 2):293s–299s pmid:7033241.
  11. 11. Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell C, et al. A foundation for reliable spatial proteomics data analysis. Mol Cell Proteomics. 2014;13(8):1937–52. pmid:24846987
  12. 12. Sadowski PG, Dunkley TPJ, Shadforth IP, Dupree P, Bessant C, Griffin JL, et al. Quantitative proteomic approach to study subcellular localization of membrane proteins. Nat Protoc. 2006;1(4):1778–1789. Available from: http://dx.doi.org/10.1038/nprot.2006.254 pmid:17487160.
  13. 13. Sadowski PG, Groen AJ, Dupree P, Lilley KS. Sub-cellular localization of membrane proteins. Proteomics. 2008 Oct;8(19):3991–4011. pmid:18780351
  14. 14. Nikolovski N, Rubtsov D, Segura MP, Miles GP, Stevens TJ, Dunkley TP, et al. Putative glycosyltransferases and other plant golgi apparatus proteins are revealed by LOPIT proteomics. Plant Physiol. 2012 Oct;160(2):1037–51. pmid:22923678
  15. 15. Groen AJ, Sancho-Andrés G, Breckels LM, Gatto L, Aniento F, Lilley KS. Identification of trans-golgi network proteins in Arabidopsis thaliana root tissue. J Proteome Res. 2014 Feb;13(2):763–76. pmid:24344820
  16. 16. Tomizioli M, Lazar C, Brugière S, Burger T, Salvi D, Gatto L, et al. Deciphering thylakoid sub-compartments using a mass spectrometry-based approach. Mol Cell Proteomics. 2014 Aug;13(8):2147–67. pmid:24872594
  17. 17. Tan DJ, Dvinge H, Christoforou A, Bertone P, Martinez AA, Lilley KS. Mapping organelle proteins and protein complexes in Drosophila melanogaster. J Proteome Res. 2009 Jun;8(6):2667–78. pmid:19317464
  18. 18. Harner M, Körner C, Walther D, Mokranjac D, Kaesmacher J, Welsch U, et al. The mitochondrial contact site complex, a determinant of mitochondrial architecture. EMBO J. 2011 Nov;30(21):4356–70. pmid:22009199
  19. 19. Andersen JS, Wilkinson CJ, Mayor T, Mortensen P, Nigg EA, Mann M. Proteomic characterization of the human centrosome by protein correlation profiling. Nature. 2003 Dec;426(6966):570–574. Available from: http://dx.doi.org/10.1038/nature02166 pmid:14654843.
  20. 20. Christoforou A, Mulvey C, Breckels LM, Gatto L, Lilley KS. Spatial Proteomics: Practical Considerations for Data Acquisition and Analysis in Protein Subcellular Localisation Studies. Quantitative Proteomics. 2014;(1):187.
  21. 21. Wiese S, Gronemeyer T, Ofman R, Kunze M, Grou CP, Almeida JA, et al. Proteomics characterization of mouse kidney peroxisomes by tandem mass spectrometry and protein correlation profiling. Mol Cell Proteomics. 2007 Dec;6(12):2045–57. pmid:17768142
  22. 22. Hall SL, Hester S, Griffin JL, Lilley KS, Jackson AP. The organelle proteome of the DT40 lymphocyte cell line. Mol Cell Proteomics. 2009 Jun;8(6):1295–1305. Available from: http://dx.doi.org/10.1074/mcp.M800394-MCP200 pmid:19181659.
  23. 23. Trotter MWB, Sadowski PG, Dunkley TPJ, Groen AJ, Lilley KS. Improved sub-cellular resolution via simultaneous analysis of organelle proteomics data across varied experimental conditions. Proteomics. 2010;10(23):4213–4219. Available from: http://dx.doi.org/10.1002/pmic.201000359 pmid:21058340.
  24. 24. Ohta S, Bukowski-Wills JC, Sanchez-Pulido L, Alves FL, Wood L, Chen ZA, et al. The protein composition of mitotic chromosomes determined using multiclassifier combinatorial proteomics. Cell. 2010 Sep;142(5):810–21. pmid:20813266
  25. 25. Tardif M, Atteia A, Specht M, Cogne G, Rolland N, Brugière S, et al. PredAlgo: a new subcellular localization prediction tool dedicated to green algae. Mol Biol Evol. 2012 Dec;29(12):3625–39. pmid:22826458
  26. 26. Cai YD, Liu XJ, Xu XB, Chou KC. Support vector machines for prediction of protein subcellular location. Mol Cell Biol Res Commun. 2000 Oct;4(4):230–3. pmid:11409917
  27. 27. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001 May;43(3):246–55. pmid:11288174
  28. 28. Li FM, Li QZ. Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach. Protein Pept Lett. 2008;15(6):612–6. pmid:18680458
  29. 29. Lin H, Ding H, Guo FB, Zhang AY, Huang J. Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett. 2008;15(7):739–44. pmid:18782071
  30. 30. Nanni L, Brahnam S, Lumini A. High performance set of PseAAC and sequence based descriptors for protein classification. J Theor Biol. 2010 Sep;266(1):1–10. pmid:20558184
  31. 31. Lin J, Wang Y. Using a novel AdaBoost algorithm and Chou’s Pseudo amino acid composition for predicting protein subcellular localization. Protein Pept Lett. 2011 Dec;18(12):1219–25. pmid:21728988
  32. 32. Mer AS, Andrade-Navarro MA. A novel approach for protein subcellular location prediction using amino acid exposure. BMC Bioinformatics. 2013;14:342. pmid:24283794
  33. 33. Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci. 1999 Jan;24(1):34–6. pmid:10087920
  34. 34. Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001 Aug;17(8):721–8. pmid:11524373
  35. 35. Gardy JL, Spencer C, Wang K, Ester M, Tusnády GE, Simon I, et al. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 2003 Jul;31(13):3613–7. pmid:12824378
  36. 36. Matsuda S, Vert JP, Saigo H, Ueda N, Toh H, Akutsu T. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci. 2005 Nov;14(11):2804–13. pmid:16251364
  37. 37. Pierleoni A, Martelli PL, Fariselli P, Casadio R. BaCelLo: a balanced subcellular localization predictor. Bioinformatics. 2006 Jul;22(14):e408–16. pmid:16873501
  38. 38. Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, et al. WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W585–7. pmid:17517783
  39. 39. Emanuelsson O, Brunak S, von Heijne G, Nielsen H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2007;2(4):953–71. pmid:17446895
  40. 40. Rastogi S, Rost B. Bioinformatics predictions of localization and targeting. Methods Mol Biol. 2010;619:285–305. pmid:20419417
  41. 41. Wang K, Hu LL, Shi XH, Dong YS, Li HP, Wen TQ. PSCL: predicting protein subcellular localization based on optimal functional domains. Protein Pept Lett. 2012 Jan;19(1):15–22. pmid:21919864
  42. 42. Arango-Argoty GA, Ruiz-Muñoz JF, Jaramillo-Garzón JA, Castellanos-Domínguez CG. An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria. Conf Proc IEEE Eng Med Biol Soc. 2012;2012:5554–7. pmid:23367187
  43. 43. Hu LL, Feng KY, Cai YD, Chou KC. Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. Protein Pept Lett. 2012 Jun;19(6):644–51. pmid:22519536
  44. 44. Jiang JQ, Wu M. Predicting multiplex subcellular localization of proteins using protein-protein interaction network: a comparative study. BMC Bioinformatics. 2012;13 Suppl 10:S20. pmid:22759426
  45. 45. Du P, Wang L. Predicting human protein subcellular locations by the ensemble of multiple predictors via protein-protein interaction network with edge clustering coefficients. PLoS One. 2014;9(1):e86879. pmid:24466278
  46. 46. Huang WL, Tung CW, Ho SW, Hwang SF, Ho SY. ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization. BMC Bioinformatics. 2008;9:80. pmid:18241343
  47. 47. Mei S, Fei W, Zhou S. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics. 2011;12:44. pmid:21284890
  48. 48. Mei S. Multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization. J Theor Biol. 2012 Jan;293:121–30. pmid:22037046
  49. 49. Mei S. Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS One. 2012;7(6):e37716. pmid:22719847
  50. 50. Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics. 2011 Jun;8(3):391–404. pmid:21679119
  51. 51. Xiao X, Lin WZ, Chou KC. Recent advances in predicting protein classification and their applications to drug development. Curr Top Med Chem. 2013;13(14):1622–35. pmid:23889055
  52. 52. Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. Int J Proteomics. 2014;2014:845479. pmid:25574395
  53. 53. Rosenstein MT, Marx Z, Kaelbling LP, Dietterich TG. Transfer or Not To Transfer. In: NIPS-05 Workshop on Inductive Transfer: 10 Years Later; 2005.
  54. 54. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015 Jan;43(Database issue):D447–52. pmid:25352553
  55. 55. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. Available from: http://dx.doi.org/10.1186/gb-2004-5-10-r80 pmid:15461798.
  56. 56. Gatto L, Breckels LM, Wieczorek S, Burger M, Lilley KS. Mass-spectrometry based spatial proteomics data analysis using pRoloc and pRolocdata. Bioinformatics. 2104;30(9):1322–1324.
  57. 57. Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, et al. Towards a knowledge-based Human Protein Atlas. Nat Biotechnol. 2010 Dec;28(12):1248–50. pmid:21139605
  58. 58. Briesemeister S, Rahnenführer J, Kohlbacher O. YLoc–an interpretable web server for predicting subcellular localization. Nucleic Acids Res. 2010 Jul;38(Web Server issue):W497–502. pmid:20507917
  59. 59. Briesemeister S, Rahnenführer J, Kohlbacher O. Going from where to why–interpretable prediction of protein subcellular localization. Bioinformatics. 2010 May;26(9):1232–8. pmid:20299325
  60. 60. Gatto L. hpar: Human Protein Atlas in R. http://www.bioconductor.org/packages/release/bioc/html/hpar.html;. R package version 1.4.0.
  61. 61. Christoforou A, Mulvey CM, Breckels LM, Hayward PC, Geladaki E, Hurrell T, et al. A draft map of the mouse pluripotent stem cell spatial proteome. Nat Commun. 2016 jan;7:9992. Available from: http://dx.doi.org/10.1038/ncomms9992 pmid:26754106.
  62. 62. Wollscheid B, Bausch-Fluck D, Henderson C, O’Brien R, Bibel M, Schiess R, et al. Mass-spectrometric identification and relative quantification of N-linked cell surface glycoproteins. Nat Biotechnol. 2009 Apr;27(4):378–86. pmid:19349973
  63. 63. Gee SH, Madhavan R, Levinson SR, Caldwell JH, Sealock R, Froehner SC. Interaction of muscle and brain sodium channels with multiple members of the syntrophin family of dystrophin-associated proteins. J Neurosci. 1998 Jan;18(1):128–37. pmid:9412493
  64. 64. Joberty G, Petersen C, Gao L, Macara IG. The cell-polarity protein Par6 links Par3 and atypical protein kinase C to Cdc42. Nat Cell Biol. 2000 Aug;2(8):531–9. pmid:10934474
  65. 65. Garrard SM, Capaldo CT, Gao L, Rosen MK, Macara IG, Tomchick DR. Structure of Cdc42 in a complex with the GTPase-binding domain of the cell polarity protein, Par6. EMBO J. 2003 Mar;22(5):1125–33. pmid:12606577
  66. 66. Brou C, Logeat F, Gupta N, Bessia C, LeBail O, Doedens JR, et al. A novel proteolytic cleavage involved in Notch signaling: the role of the disintegrin-metalloprotease TACE. Mol Cell. 2000 Feb;5(2):207–16. pmid:10882063
  67. 67. Breckels LM, Gatto L, Christoforou A, Groen AJ, Lilley KS, Trotter MW. The effect of organelle discovery upon sub-cellular protein localisation. J Proteomics. 2013 Aug;88:129–40. pmid:23523639
  68. 68. McAlister GC, Nusinow DP, Jedrychowski MP, Wühr M, Huttlin EL, Erickson BK, et al. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Anal Chem. 2014 Jul;86(14):7150–8. pmid:24927332
  69. 69. Chang W, Cheng J, Allaire J, Xie Y, McPherson J. shiny: Web Application Framework for R; 2015. R package version 0.11.1. Available from: http://CRAN.R-project.org/package=shiny.
  70. 70. Hall MA. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29–July 2, 2000; 2000. p. 359–366.
  71. 71. Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, et al. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 2010 Jan;38(Database issue):D161–6. pmid:19858104
  72. 72. Mangasarian OL. Generalized Support Vector Machines. In: Smola AJ, Bartlett P, Schölkopf B, Schuurmans D, editors. Advances in Large Margin Classifiers. MIT Press; 2000. p. 135–146.
  73. 73. Shawe-Taylor J, Cristianini N. Kernel Methods for Pattern Analysis. Cambridge University Press; 2004.
  74. 74. Mercer J. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A. 1909;209:441–458.
  75. 75. Wu W, Xu J, Li H, Oyama S. Asymmetric Kernel Learning. Microsoft Research; 2010. MSR-TR-2010-85.
  76. 76. Lanckriet GRG, ijl De Bie, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004;20(16):2626–2635. pmid:15130933
  77. 77. Gönen M, Alpaydin E. Multiple kernel learning algorithms. Journal of Machine Learning Research. 2011;12:2211–2268.
  78. 78. Chang CC, Lin CJ. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology. 2001 4;2(3).
  79. 79. Knerr S, Personnaz L, Dreyfus G. Single-layer learning revisited: a stepwise procedure for building and training a neural network. Neurocomputing: Algorithms, Architectures and Applications. 1990;.
  80. 80. Morik K, Brockhausen P, Joachims T. Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. In: Proceedings of the International Conference on Machine Learning (ICML); 1999. p. 268–277.
  81. 81. Platt JC. Probabilities for SV Machines. In: Smola AJ, Bartlett P, Schölkopf B, Schuurmans D, editors. Advances in Large Margin Classifiers. MIT Press; 2000. p. 61–74.
  82. 82. Lin HT, Lin CJ, Weng RC. A note on Platt’s probabilistic outputs for support vector machines. Machine Learning. 2007;68:267–276.
  83. 83. He H. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2009;21(9):1263–1284.