A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites

Xuan Xiao; Zhi-Cheng Wu; Kuo-Chen Chou

doi:10.1371/journal.pone.0020592

Abstract

Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. In this paper, by introducing the “multi-label scale” and hybridizing the information of gene ontology with the sequential evolution information, a novel predictor called iLoc-Gneg is developed for predicting the subcellular localization of Gram-positive bacterial proteins with both single-location and multiple-location sites. For facilitating comparison, the same stringent benchmark dataset used to estimate the accuracy of Gneg-mPLoc was adopted to demonstrate the power of iLoc-Gneg. The dataset contains 1,392 Gram-negative bacterial proteins classified into the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. Of the 1,392 proteins, 1,328 are each with only one subcellular location and the other 64 are each with two subcellular locations, but none of the proteins included has pairwise sequence identity to any other in a same subset (subcellular location). It was observed that the overall success rate by jackknife test on such a stringent benchmark dataset by iLoc-Gneg was over 91%, which is about 6% higher than that by Gneg-mPLoc. As a user-friendly web-server, iLoc-Gneg is freely accessible to the public at http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg. Meanwhile, a step-by-step guide is provided on how to use the web-server to get the desired results. Furthermore, for the user's convenience, the iLoc-Gneg web-server also has the function to accept the batch job submission, which is not available in the existing version of Gneg-mPLoc web-server. It is anticipated that iLoc-Gneg may become a useful high throughput tool for Molecular Cell Biology, Proteomics, System Biology, and Drug Development.

Citation: Xiao X, Wu Z-C, Chou K-C (2011) A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites. PLoS ONE 6(6): e20592. https://doi.org/10.1371/journal.pone.0020592

Editor: Franca Fraternali, King's College London, United Kingdom

Received: February 26, 2011; Accepted: May 4, 2011; Published: June 17, 2011

Copyright: © 2011 Xiao et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by grants from the National Natural Science Foundation of China (No. 60961003), the Key Project of Chinese Ministry of Education (No. 210116), the Province National Natural Science Foundation of JiangXi (2009GZS0064 and 2010GZS0122), the Department of Education of Jiang-Xi Province (No. GJJ09271), and the plan for training youth scientists (stars of Jing-Gang) of Jiangxi Province. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Bacteria can be divided into two groups: Gram-positive and Gram-negative. Gram-positive bacteria are those that are stained dark blue or violet by Gram staining; while Gram-negative bacteria cannot retain the stain, instead taking up the counter-stain and appearing red or pink.

It has special meaning for both basic research and drug design to study bacteria because (1) they are the workhorses for the fields of molecular biology, biochemistry, and genetics due to their ability to quickly grow and being relatively easier to be manipulated, and (2) they are both harmful and useful. With the explosion of protein sequences generated in the post-genomic era, we are challenged to develop computational methods for timely and accurately identifying the subcellular locations of newly discovered bacterial proteins based on their sequence information alone because this kind of knowledge will be very useful for selecting proper bacterial proteins for a special target, or screening and prioritizing candidates in drug design.

Actually, numerous predictors were developed for identifying subcellular localization of proteins in various organisms (see [1], [2] as well as the long list of references cited in the two review papers). However, those that are specialized for dealing with Gram-negative proteins are only a few. They are called “PSORT” [1], [3], [4], “PSORT-B” [5], and PSORTb v.2.0 [6]. All these methods have played important roles in stimulating the development of this area. To improve the prediction coverage scope and the quality of benchmark datasets, the predictor called Gneg-PLoc [7] was developed. Compared with the previous methods, Gneg-PLoc extended the coverage scope from five to eight subcellular location sites. Also, the benchmark datasets used to train and test the predictor have been significantly refined. For instance, the benchmark datasets used in PSORT-B [5] contain many proteins with pairwise sequence identity higher than 90%, while in the benchmark datasets of Gneg-PLoc [7] none of the proteins included has pairwise sequence identity to any other in a same subcellular location; i.e., the latter is much more stringent and rigorous than the former in excluding the homology bias and redundancy. Also, Gneg-PLoc was able to yield higher success rates.

However, all the aforementioned predictors cannot be used to deal with multiplex proteins that may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery [8], [9]. Particularly, as pointed out by Millar et al. [10], recent evidences have indicated that an increasing number of proteins have multiple locations in the cell.

To make Gneg-PLoc [7] be able to deal with multiplex Gram-negative proteins as well, a predictor called Gneg-mPLoc [11] was developed recently, where the character “m” in front of “PLoc” stands for “multiple”, meaning that it can be also used to deal with Gram-negative bacterial proteins with multiple locations.

However, Gneg-mPLoc has the following shortcomings. (1) In predicting the number of subcellular location sites for a query Gram-negative protein, an optimal threshold factor (see Eq.48 of [2]) was adopted without providing its statistical implication and detailed learning process. It would be more instructive if we could find a more intuitive approach to determine this with a more natural manner. (2) In formulating the protein samples, only the integer numbers 0 and 1 were used to reflect the GO (gene ontology) information [12], [13]. Such an over-simplified formulation might cause some useful information lost so as to limit the prediction quality. (3) Although a web-server for Gneg-mPLoc has been established at http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/, only one query protein sequence at a time is allowed when using the web-server to conduct prediction. For the convenience of users in handling many query Gram-negative protein sequences, such a rigid limit should be improved.

The present study was dedicated to develop a new and more powerful predictor, called iLoc-Gneg, for predicting Gram-negative bacterial protein subcellular localization by addressing the above three problems.

To establish a really useful statistical predictor for protein system, we usually need to consider the following procedures [14]: (1) select or construct a valid benchmark dataset to train and test the predictor; (2) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (5) establish a user-friendly web-server [15] for the predictor that is accessible to the public. Below, let us describe how to realize these steps one by one.

Materials and Methods

Here, we choose to use the same dataset in establishing Gneg-mPLoc [11] as the benchmark dataset for the current study. The reasons doing so are as follows. (1) The dataset was constructed specialized for Gram-negative bacterial proteins and it can cover 8 subcellular location sites; compared with the other datasets such as the one in PSORTb v.2.0 [6] that only covered 5 subcellular locations, the coverage scope of the dataset from [11] is much wider. (2) None of proteins included in has pairwise sequence identity to any other in a same subcellular location; compared with most of the other benchmark datasets in this area, the dataset is much more rigorous in excluding homology bias and redundancy. (3) It contains both singleplex and multiplex proteins and hence can be used to train and test a predictor developed aimed at being able to deal with proteins with both single and multiple location sites. (4) Using the dataset will also make it easier to compare the new predictor with the existing one because the tested results by Gneg-mPLoc on have been well documented and reported [11].

The dataset contains 1,392 Gram-negative bacterial protein sequences, of which 1,328 belong to one subcellular location, 64 to two locations, and none to three or more locations. The dataset covers 8 subcellular locations (Fig. 1), as can be formulated by(1)where represents the subset for the subcellular location of cell inner membrane, for cell outer membrane, for cytoplasm, for extracellular, and so forth (Table 1); while represents the symbol for “union” in the set theory. To avoid homology bias and redundancy, none of the proteins in has pairwise sequence identity to any other in a same subset. For convenience, hereafter let us just use the subscripts of Eq.1 as the codes of the 8 location sites; i.e., “1” for “cell membrane”, “2” for “cell wall”, “3” for “chloroplast”, and so forth (Table 2).

Download:

Figure 1. Illustration to show the 8 subcellular locations of Gram-negative bacterial proteins.

The 8 locations are: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. Note that in prokaryotic life forms, the nucleoid region is the part of the cell that contains the DNA molecule; unlike the true nucleus of eukaryotes, it is not delimited by a membrane.

https://doi.org/10.1371/journal.pone.0020592.g001

Download:

Table 1. Breakdown of the Gram-negative bacterial protein benchmark dataset

taken from [11].

https://doi.org/10.1371/journal.pone.0020592.t001

Download:

Table 2. A comparison of the jackknife success rates by Gnec-mPLoc [11] and the current iLoc-Gneg on the benchmark dataset

(cf. Supporting Information S1) that covers 8 location sites of Gram-negative bacterial proteins in which none of the proteins included has

25% pairwise sequence identity to any other in a same location.

https://doi.org/10.1371/journal.pone.0020592.t002

For readers' convenience, the corresponding accession numbers and protein sequences in are given in Supporting Information S1.

Note that because some proteins may occur in two or more locations, the 1,392 Gram-negative proteins actually correspond to 1,456 locative proteins. The concept of “locative proteins” was introduced for studying proteins with multiple subcellular location sites, as elaborated in [2].

To develop a powerful method for statistically predicting protein subcellular localization according to the sequence information, one of the most important things is to formulate the protein sequences with an effective mathematical expression that can truly reflect the intrinsic correlation with their subcellular localization [14]. However, it is by no means an easy job to realize this because this kind of correlation is usually deeply “buried” or hidden in piles of complicated sequences.

The most straightforward method to formulate the sample of a query protein was just using its entire amino acid sequence, as can be generally written by(2)where represents the 1^st residue of the protein , the 2^nd residue, …, the residue, and they each belong to one of the 20 native amino acids. In order to identify its subcellular location(s), the sequence-similarity-search-based tools, such as BLAST [16], [17], was utilized to search protein database for those proteins that have high sequence similarity to the query protein . Subsequently, the subcellular location annotations of the proteins thus found were used to deduce the subcellular location(s) for . Unfortunately, although it was quite intuitive and able to contain the entire information of a protein sequence, this kind of straightforward sequential model failed to work when the query protein did not have significant sequence similarity to any location-known proteins.

Thus, various non-sequential or discrete models to formulate protein samples were proposed in hopes to establish some sort of correlation or cluster manner by which the prediction quality could be improved.

Among the discrete models for a protein sample, the simplest one is its amino acid (AA) composition or AAC [18]. According to the AAC-discrete model, the protein of Eq.2 can be formulated by [19], [20](3)where are the normalized occurrence frequencies of the 20 native amino acids in protein , and the transposing operator. Many methods for predicting protein subcellular localization were based on the AAC-discrete model (see, e.g., [19], [21], [22], [23], [24]). However, as we can see from Eq.3, if using the ACC model to represent the protein , all its sequence-order effects would be lost, and hence the prediction quality might be limited.

To avoid completely lose the sequence-order information, the pseudo amino acid composition (PseAAC) was proposed to represent the sample of a protein, as formulated by [25](4)where the first 20 elements are associated with the 20 elements in Eq.3 or the 20 amino acid components of the protein , while the additional factors are used to incorporate some sequence-order information via a series of rank-different correlation factors along a protein chain. For a brief introduction about PseAAC, please see a Wikipedia article at http://en.wikipedia.org/wiki/Pseudo_amino_acid_composition.

According to [14], the PseAAC for a protein can be generally formulated as(5)where the subscript is an integer, and its value as well as the components , , … will depend on how to extract the desired information from the amino acid sequence of (cf. Eq.2). As a general form, Eq.5 can cover various different modes of PseAAC. For example, when its elements are given by(6)we immediately obtain the formulation of PseAAC as originally introduced in [25], where the meanings for , , and were clearly elaborated and hence there is no need to repeat here.

Below, let us use the general form of PseAAC (Eq.5) to find the formulations to reflect the core and essential features of protein samples that are closely correlated with their subcellular localization.

1. GO (Gene Ontology) Formulation

GO database [12] was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations [2], [26]. However, in order to incorporate more information, instead of only using 0 and 1 elements as done in [11], here let us use a different approach as described below.

Step 1.

Compression and reorganization of the existing GO numbers. The GO database (version 74.0 released 30 July 2009) contains many GO numbers. However, these numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000001, GO:0000002, GO:0000003, GO:0000009, GO:00000011, GO:0000012, GO:0000015, …, GO:0090204 would become GO_compress: 00001, GO_compress: 00002, GO_compress: 00003, GO_compress: 00004, GO_compress: 00005, GO_compress: 00006, GO_compress: 00007, ……, GO_compress: 11118, respectively. The GO database obtained thru such a treatment is called GO_compress database, which contains 11,118 numbers increasing successively from 1 to the last one.

Step 2.

Using Eq.5 with , the protein can be formulated as(7)where are defined via the following steps.

Step 3.

Use BLAST [27] to search the homologous proteins of the protein from the Swiss-Prot database (version 55.3), with the expect value for the BLAST parameter.

Step 4.

Those proteins which have pairwise sequence identity with the protein are collected into a set, , called the “homology set” of . All the elements in can be deemed as the “representative proteins” of , sharing some similar attributes such as structural conformations and biological functions [28], [29], [30]. Because they were retrieved from the Swiss-Prot database, these representative proteins must each have their own accession numbers.

Step 5.

Search each of these accession numbers collected in Step 4 against the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO numbers [31].

Step 6.

Based on the results obtained in Step 5, the elements in Eq.7 can be written as(8)where is the number of representative proteins in , and(9)

As we can see from Eq.7, the GO formulation derived from the above steps consists of 11,118 real numbers rather than only the elements 0 and 1 as in the GO formulation adopted in [11].

Note that the GO formulation of Eq.6 may become a naught vector or meaningless under any of the following situations: (1) the protein does not have significant homology to any protein in the Swiss-Prot database, i.e., meaning the homology set is an empty one; (2) its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset.

Under such a circumstance, let us consider using the sequential evolution formulation to represent the protein , as described below.

2. SeqEvo (Sequential Evolution) Formulation

Biology is a natural science with historic dimension. All biological species have developed continuously starting out from a very limited number of ancestral species. It is true for protein sequence as well [30]. Their evolution involves changes of single residues, insertions and deletions of several residues [32], gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing in a same subcellular location.

To incorporate the sequential evolution information into the PseAAC of Eq.4, here let us use the information of the PSSM (Position-Specific Scoring Matrix) [27], as described below.