iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model

Wei-Zhong Lin; Jian-An Fang; Xuan Xiao; Kuo-Chen Chou

doi:10.1371/journal.pone.0024756

Abstract

DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power.

By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the “grey model” and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins.

As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results.

Citation: Lin W-Z, Fang J-A, Xiao X, Chou K-C (2011) iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE 6(9): e24756. https://doi.org/10.1371/journal.pone.0024756

Editor: Vladimir N. Uversky, University of South Florida College of Medicine, United States of America

Received: July 24, 2011; Accepted: August 16, 2011; Published: September 15, 2011

Copyright: © 2011 Lin et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the grants from the National Natural Science Foundation of China (No. 60961003), the Key Project of Chinese Ministry of Education (No. 210116), the Province National Natural Science Foundation of JiangXi (2009GZS0064 and 2010GZS0122), and the department of education of Jiang-Xi Province (No. GJJ09271). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

DNA-binding proteins play a vitally important role in many biological processes, such as recognition of specific nucleotide sequences, regulation of transcription, and regulation of gene expression. At present, several experimental techniques (such as filter binding assays, genetic analysis, chromatin immunoprecipitation on microarrays, and X-ray crystallography) have been used for identifying DNA-binding proteins. Although these techniques can provide a detailed picture about the binding, they are both time-consuming and expensive [1]. Particularly, the number of newly discovered protein sequences has been increasing extremely fast. For example, in 1986 the Swiss-Prot [2] database contained only 3,939 protein sequence entries, but now the number has jumped to 530,264 according to the release 2011_07 on 28-Jun-2011 by the UniProtKB/Swiss-Prot at http://web.expasy.org/docs/relnotes/relstat.html, meaning that the number of protein sequence entries now is more than 134 times the number from about 25 years ago. Facing the avalanche of new protein sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying and characterizing DNA-binding proteins based on the protein sequence information alone.

Actually, numerous predictors were developed in this regard. For instance: Shanahan et al. [3] demonstrated how structural features were employed to determine whether a protein of known structure and unknown function was a DNA-binding proteins or not; Ahmad et al. [4] depicted how to distinguish DNA-binding and non DNA-binding proteins with net charge, net dipole moment and quadrupole moment, respectively; Nordhoff et al. [5] introduced identification and characterization of DNA-binding proteins by mass spectrometry, which was regarded as the most sensitive and specific analytical technique available for protein identification [6]. All the aforementioned methods were considerably relied on the results from biochemical experiments. Among the existing methods, those which are purely based on theoretical approaches are of using various classifying engines, such as support vector machine (SVM) [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], artificial neural network (ANN) [16], [17], [18], [19], [20], [21], random forest [22], [23], [24], nearest neighbor [25], and boosted decision trees [26].

In addition to using different prediction engines, a remarkable difference among the existing methods is in that different features were extracted to represent protein samples. For instance, Bhardwaj [13] used a 40-D (dimensional) feature vector to formulate a protein sample that contains positive potential surface patches, overall charge of the protein, and overall surface composition. Yu et al. [10] constructed a 132-D feature vector to represent a protein sequence by using the information of hydrophobicity, predicted secondary structure, solvent accessibility, normalized van der Waals volume, polarity, and polarizability. Bhardwaj and Lu [9] represented the sample of a protein by harnessing the 70 features of the DNA-binding residues, including the residue's identity, charge, solvent accessibility, average potential, secondary structure, neighboring residues, and location in a cationic patch. Kumar et al. [14] extracted the features from the PSSM (Position-Specific Scoring Matrix) profile obtained from PSI-BLAST [27] to represent the protein sample. Subsequently, a different approach was proposed [22] to encode each protein sequence with 116 features by incorporating various physic-chemical properties of amino acids. Meanwhile, Nanni and Lumini [15] proposed a method to represent a protein sample by combing ontologies and dipeptide composition. Later, the same authors [6] introduced the grouped weight to represent protein samples for predicting DNA-binding proteins. Langlois and Lu [1] represented a protein sample with 472 features, of which 240 were secondary structure features, 231 dipeptide composition features, and one for the total charge over its amino acid sequence.

However, the existing predictors have the following shortcomings. (1) The extracting features are very complicated and their dimensions are too large. Particularly, during the prediction process, some of the existing predictors even need the informations of query proteins that were obtained from other experiments, such as their three-dimensional (3D) structures, functions, and the other relevant knowledge. (2) The computational time needed by these predictors is usually very long; for instance, the predictor iDBPs [23] would usually take about 30 minutes for predicting one query protein. (3) Most predictors did not provide a web-server for the public usage, while the others claimed they did but are currently not in working condition and hence their practical application value is quite limited.

In view of this, the present study was initiated in an attempt to develop a new and more powerful predictor by addressing the aforementioned three problems.

According to a recent comprehensive review [28], to establish a really useful statistical predictor for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps.

Materials and Methods

1. Benchmark datasets

DNA-binding protein sequences were collected from the Protein Data Bank (PDB) release 03-May-2011 at http://www.pdb.org/, in which there are 72,838 structures. By searching the keywords of “Protein-DNA complex” and “DNA binding” through the “Advanced Search Interface”, we extracted 3,689 structures from (PDB).

To construct a high quality benchmark dataset with a wider coverage scope and lower homology bias, the data obtained above were screened strictly according to the following criteria. (1) Sequences with less than 50 amino acid (AA) residues were removed because they might just belong to fragments [29]. (2) Sequences with more than 10 consecutive character of “X” were taken away because they contained too many unknown amino acids. (3) To reduce redundancy and homology bias, the PISCES [30], [31] was utilized to cutoff those sequences that have pairwise sequence identity to any other in the dataset. Finally, we obtained 212 DNA-binding proteins. Similarly, 212 non DNA-binding protein domains were randomly picked from the data bank. Accordingly, the benchmark dataset thus obtained consists of 424 protein sequences of which half are DNA-binding protein sequences and the other half non-binding protein sequences. Their accession codes and sequences are given in Information S1.

2. A novel pseudo amino acid composition of grey model

To develop a powerful predictor for a protein system, one of the keys is to formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted [28]. To realize this, the concept of pseudo amino acid composition (PseAAC) was proposed [32] to replace the simple amino acid composition (AAC) for representing the sample of a protein. According to Eq. 6 of [28], the general form of PseAAC for a protein can be formulated as(1)where is a transpose operator, while the subscript is an integer and its value as well as the components , , … will depend on how to extract the desired information from the amino acid sequence of .

In this study, we are to use the grey model parameters to define the elements in Eq. 1. In 1989, Deng [33] proposed a grey system theory to investigate the uncertainty of a system. According to this theory, if the information of a system investigated is fully known, it is called a “white system”; if completely unknown, a “black system”; if partially known, a “grey system”. The model developed on the basis of such a theory is called “grey model”, which is a kind of nonlinear and dynamic model formulated by a differential equation. The grey model is particularly useful for solving complicated problems that are lack of sufficient information, or need to process uncertain information and reduce random effects of acquired data.

In the grey system theory, an important and generally used model is called GM(1,1). It is quite effective for monotonic series, with good simulating effect and small error, as reflected by the fact that using the GM(1,1) model has remarkably improved the success rates in predicting protein structural classes [34]. However, if the series concerned are not monotonic, the simulating effect of GM(1,1) would not be good and its error might be quite large.

To overcome the above problem, the grey system theory used in the current study is a special grey dynamic model called GM(2,1), which can be used to handle the oscillation series. In GM(2,1) the strategy of minimum squares will be adopted to determine the uncertain parameters, as can be briefly described below.

One of the most commonly used approaches in the grey system theory is the accumulative generation operation (AGO), which can convert a series without any obvious regularity into a strict monotonic increasing series so as to reduce the randomness and enhance the smoothness of the series, and minimize interference from the random information. Let us assume that is the original series of real numbers with an irregular distribution, and it is a non-negative original data sequence. Then, is viewed as the first-order accumulative generation operation (1-AGO) series for; i.e., the components in are given by(2)The GM(2,1) model can be expressed by the following second-order grey differential equation with one variable:(3)where:(4)(5)In Eq. 3, the coefficients and are the developing coefficients, and the influence coefficient. Then we have(6)Thus, it follows by the least-squares method that(7)where(8)(9)The least-square estimator for the coefficients , and should carry some intrinsic information contained in the discrete data sequence sampled from the system investigated. In view of this, the incorporation of these coefficients into the general form of PseAAC (Eq. 1) will make the formulation of a protein sample better to reflect its intrinsic correlation with the attribute to be predicted. This is the key of the novel approach. The concrete procedures are as follows.

A protein sequence is composed of 20 different types of native amino acids denoted by A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. Before using the grey dynamic model GM(2,1), we need to represent the protein sequence by a series of real numbers. Listed in Table 1 is the numerical codes used in this study to represent the 20 amino acids.

Download:

Table 1. The numerical codes of 20 native amino acids.

https://doi.org/10.1371/journal.pone.0024756.t001

The factor score for molecular volume are adopted [35] that is related to the molecular size or volume as well as side chain weight [35]. Because in the current study, only the non-negative sequences will be considered, we can adopt the following function for modeling(10)Through the above function, each of the 20 amino acids can be translated into numerical variable within the region of (0, 1) (Table 1). With the numerical codes thus obtained, we can convert a protein sequence to a series of real numbers. Thus, the three coefficients for any protein sequence can be derived with the grey model GM(2,1) by following Eqs.2–9.

3. Predicting algorithm

The three coefficients obtained in the above section, in addition to the 20 components in the classical amino acid composition [36], can be used to form a new mode of PseAAC, with components. Thus, according Eq. 1, the protein can be formulated with a new mode of PseAAC as given by(11)where are the occurrence frequencies of the 20 different types of amino acids in the protein concerned, while represent the absolute value of coefficients, and , respectively.

Now the Random Forest (RF) algorithm was adopted to perform the prediction. RF is a popular machine learning algorithm and recently it has been successfully employed in dealing with various biological prediction problems [37], [38], [39], [40]. RF is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. It has been shown that combining multiple trees produced in randomly selected subspaces can significantly improve the prediction accuracy. RF performs a type of cross-validation by using out-of-bag samples. During the training process, each tree is constructed using a different bootstrap sample from the original data. For the detailed description about of the RF algorithm, refer to the papers [41], [42], [43].

The RF algorithm is available via the link at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm. Recently, the RF tool for the MATLAB windows is also available at http://code.google.com/p/randomforest-matlab/that has two important functions: one is “classRF_train” for training given data and returning the prediction model, and the other is “classRF_predict” for predicting query input with the prediction model. The classifier in this paper was developed based on the RF tool for the MATLAB windows.

The classifier thus established is called iDNA-Prot, which can be used to predict whether a protein can bind with DNA according to its sequence information alone.

For practical applications, a web-server of iDNA-Prot was established at the web-site http://icpr.jci.edu.cn/bioinfo/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide on how to use the web-server is given in Information S3, by which users can easily get their desired results without the need to follow the complicated mathematic equations involved in developing the iDNA-Prot predictor.

Results and Discussion

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [44]. However, of the three test methods, the jackknife test is deemed the most objective [45]. The reasons are as follows. (1) For the independent dataset test, although all the proteins used to test the predictor are outside the training dataset used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset [44]. (2) For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold or 10-fold cross-validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset, as demonstrated by Eqs.28–30 in [28]. Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as a good one. (3) In the jackknife test, all the proteins in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining protein samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each protein sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. Accordingly, the jackknife test has been increasingly and widely used by those investigators with strong math background to examine the quality of various predictors (see, e.g.,[46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57]). In view of this, here the jackknife cross-validation was also used to examine the prediction quality of the current predictor.

The results thus obtained with iDNA-Prot on the benchmark dataset of Information S1 is given in Table 2, from which we can see that the overall success rate was 83.96% in identifying proteins as DNA-binding proteins and non-DNA-binding proteins.

Download:

Table 2. Results obtained by iDNA-Prot on the benchmark dataset of Information S1 through the jackknife test^a.

https://doi.org/10.1371/journal.pone.0024756.t002

Furthermore, as a demonstration to show that the current predictor iDNA-Prot is superior to the existing ones, let us compare iDNA-Prot with DNA-Prot [22]. The reason we chose DNA-Prot for comparison is because among the existing methods for predicting DNA-binding proteins, the reported success rate achieved by DNA-Prot [22] is the highest. The datasets used to train and test DNA-Prot [22] as well as its standalone version can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.

The training dataset used for DNA-Prot [22] contains 146 DNA-binding proteins and 250 non-DNA-binding proteins.

The data used to test DNA-Prot [22] contain the following three sets: (i) testing dataset-1, , consisting of 92 DNA-binding proteins and 100 non DNA-binding proteins; (ii) testing dataset-2, , consisting of 823 DNA-binding proteins and 823 non DNA-binding proteins; and (iii) testing dataset-3, , consisting of 88 DNA-binding proteins and 233 non DNA-binding proteins. All these datasets were elaborated in [22].

However, it was found (see Information S2) that there are 10 identical protein sequences between the 146 DNA-binding proteins in the training dataset and the 92 DNA-binding proteins in the 1^st testing dataset , that 19 identical protein sequences between the 146 DNA-binding proteins in the training dataset and the 88 DNA-binding proteins in the 3rd testing dataset , and that 94 identical protein sequences between the 250 non-DNA-binding proteins in the training dataset and the 233 non-DNA-binding proteins in the 3rd testing dataset . In other words, the so-called independent datasets used by DNA-Prot [22] were actually not independent and hence would lead to over-estimated success rates.

Accordingly, to perform an objective and fair comparison of iDNA-Prot with DNA-Prot [22], let us construct a real independent dataset by randomly picking some DNA-binding proteins and non DNA-binding proteins from PDB (Protein Data Bank) according to the following criteria. These proteins must not occur in the training dataset of iDNA-Prot nor in the training dataset for DNA-Prot [22], and that none of the proteins included has more than 40% sequence identity to any other in a same subset. By doing so, we obtained a real independent dataset , in which 122 proteins are DNA-binding proteins and 122 non DNA-binding proteins. The sequences and accession numbers for such 244 independent proteins are given in the Information S4.

Listed in Table 3 are the tested results obtained respectively by DNA-Prot [22] and iDNA-Prot on the 244 independent proteins in (Information S4). From the table we can see that the success rates by iDNA-Prot in identifying both DNA-binding and non-DNA-binding proteins are remarkably higher than those by DNA-Prot [22], and that the overall success rate achieved by iDNA-Prot is about 13% higher than that by DNA-Prot [22].

Download:

Table 3. A comparison of the predicted results by DNA-Prot [22] and iDNA-Prot on the independent dataset in the Information S3.

https://doi.org/10.1371/journal.pone.0024756.t003

In addition to yielding higher success rates than those by the relevant existing predictors, the computational time needed by iDNA-Prot to complete a prediction is also significantly shorter than any of its counterparts, and hence iDNA-Prot may become a useful high throughput tool for large-scale investigation of DNA-binding proteins.

Moreover, as a demonstration to show the efficiency of the current method, the hypothetical proteins that are annotated as DNA-binding proteins were used to test iDNA-Prot. The information about this kind of hypothetical proteins can be obtained at http://www.ncbi.nlm.nih.gov/protein/?term=DNAbindinghypothetical, from which we randomly picked 100 DNA-binding hypothetical proteins for test. The results predicted by iDNA-Prot on these proteins are given in Table 4, from which we can see the overall success rate is 90%.

Download:

Table 4. The predicted results by iDNA-Prot on the 100 DNA-binding hypothetical proteins from http://www.ncbi.nlm.nih.gov/protein/?term=DNAbindinghypothetical.

https://doi.org/10.1371/journal.pone.0024756.t004

Supporting Information

Information S1.

The benchmark dataset includes 424 proteins, classified into 212 DNA-binding proteins and 212 non DNA-binding proteins. Both the accession identifier of PDB (Protein Data Bank) and sequences are given. None of the proteins has more than 25% sequence identity to any other in a same subset. See the text of the paper for further explanation.

https://doi.org/10.1371/journal.pone.0024756.s001

(PDF)

Information S2.

List of protein codes that occur in both the training and testing datasets for DNA-Prot (Kumar et al., 2009). See the main paper for further explanation.

https://doi.org/10.1371/journal.pone.0024756.s002

(PDF)

Information S3.

A step-by-step guide on how to use the web-server of iDNA-Prot to get the desired results.

https://doi.org/10.1371/journal.pone.0024756.s003

(PDF)

Information S4.

The independent dataset includes 244 proteins, classified into 122 DNA-binding proteins and 122 non DNA-binding proteins. Both the accession identifier of PDB (Protein Data Bank) and sequences are given. None of the proteins has more than 40% sequence identity to any other in a same subset. See the text of the paper for further explanation.

https://doi.org/10.1371/journal.pone.0024756.s004

(PDF)

Acknowledgments

The authors wish to thank the editor and the anonymous reviewer for their constructive comments, which were very helpful for strengthening the presentation of this paper.

Author Contributions

Conceived and designed the experiments: WZL XX KCC. Performed the experiments: WZL JAF. Analyzed the data: WZL KCC. Contributed reagents/materials/analysis tools: XX. Wrote the paper: WZL KCC.

References

1. Langlois RE, Lu H (2010) Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res 38: 3149–3158.
- View Article
- Google Scholar
2. Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Research 25: 31–36.
- View Article
- Google Scholar
3. Shanahan HP, Garcia MA, Jones S, Thornton JM (2004) Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Research 32: 4732–4741.
- View Article
- Google Scholar
4. Ahmad S, Gromiha MM, Sarai A (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 20: 477–486.
- View Article
- Google Scholar
5. Nordhoff E, Krogsdam AM, Jorgensen HF, Kallipolitis BH, Clark BF, et al. (1999) Rapid identification of DNA-binding proteins by mass spectrometry. Nat Biotechnol 17: 884–888.
- View Article
- Google Scholar
6. Nanni L, Lumini A (2009) An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 36: 167–175.
- View Article
- Google Scholar
7. Brown JB, Akutsu T (2009) Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC Bioinformatics 10: 25.
- View Article
- Google Scholar
8. Cai YD, Lin SL (2003) Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta 1648: 127–133.
- View Article
- Google Scholar
9. Bhardwaj N, Lu H (2007) Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett 581: 1058–1066.
- View Article
- Google Scholar
10. Yu X, Cao J, Cai Y, Shi T, Li Y (2006) Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 240: 175–184.
- View Article
- Google Scholar
11. Fang Y, Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids 34: 103–109.
- View Article
- Google Scholar
12. Shao X, Tian Y, Wu L, Wang Y, Jing L, et al. (2009) Predicting DNA- and RNA-binding proteins from sequences with kernel methods. J Theor Biol 258: 289–293.
- View Article
- Google Scholar
13. Bhardwaj N, Langlois RE, Zhao G, Lu H (2005) Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res 33: 6486–6493.
- View Article
- Google Scholar
14. Kumar M, Gromiha MM, Raghava GP (2007) Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 8: 463.
- View Article
- Google Scholar
15. Nanni L, lumini A (2008) Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids 34: 635–641.
- View Article
- Google Scholar
16. Patel AK, Patel S, Naik PK (2010) Prediction and Classification of DNA Binding Proteins into Four Major Classes Based on Simple Sequence Derived Features Using Ann. Digest Journal of Nanomaterials and Biostructures 5: 191–200.
- View Article
- Google Scholar
17. Patel AK, Patel S, Naik PK (2009) Binary Classification of Uncharacterized Proteins into DNA Binding/Non-DNA Binding Proteins from Sequence Derived Features Using Ann. Digest Journal of Nanomaterials and Biostructures 4: 775–782.
- View Article
- Google Scholar
18. Molparia B, Goyal K, Sarkar A, Kumar S, Sundar D (2010) ZiF-Predict: a web tool for predicting DNA-binding specificity in C2H2 zinc finger proteins. Genomics Proteomics Bioinformatics 8: 122–126.
- View Article
- Google Scholar
19. Ahmad S, Sarai A (2004) Moment-based prediction of DNA-binding proteins. Journal of Molecular Biology 341: 65–71.
- View Article
- Google Scholar
20. Keil M, Exner TE, Brickmann J (2004) Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network. J Comput Chem 25: 779–789.
- View Article
- Google Scholar
21. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y (2003) Annotating nucleic acid-binding function based on protein structure. Journal of Molecular Biology 326: 1065–1079.
- View Article
- Google Scholar
22. Kumar KK, Pugalenthi G, Suganthan PN (2009) DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 26: 679–686.
- View Article
- Google Scholar
23. Nimrod G, Schushan M, Szilagyi A, Leslie C, Ben-Tal N (2010) iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics 26: 692–693.
- View Article
- Google Scholar
24. Nimrod G, Szilagyi A, Leslie C, Ben-Tal N (2009) Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J Mol Biol 387: 1040–1053.
- View Article
- Google Scholar
25. Cai Y, He J, Li X, Lu L, Yang X, et al. (2009) A novel computational approach to predict transcription factor DNA binding preference. J Proteome Res 8: 999–1003.
- View Article
- Google Scholar
26. Neumann A, Holstein J, Le Gall JR, Lepage E (2004) Measuring performance in health care: case-mix adjustment by boosted decision trees. Artif Intell Med 32: 97–113.
- View Article
- Google Scholar
27. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29: 2994–3005.
- View Article
- Google Scholar
28. Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology 273: 236–247.
- View Article
- Google Scholar
29. Chou K-C, Shen H-B (2007) Recent progress in protein subcellular location prediction. Analytical Biochemistry 370: 1–16.
- View Article
- Google Scholar
30. Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res 33: W94–98.
- View Article
- Google Scholar
31. Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19: 1589–1591.
- View Article
- Google Scholar
32. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43: 246–255.
- View Article
- Google Scholar
33. Deng JL (1989) Introduction to Grey System Theory. The Journal of Grey System 1–24.
- View Article
- Google Scholar
34. Xiao X, Lin WZ, Chou KC (2008) Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 29: 2018–2024.
- View Article
- Google Scholar
35. Atchley WR, Zhao JP, Fernandes AD, Druke T (2005) Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America 102: 6395–6400.
- View Article
- Google Scholar
36. Chou KC (1995) A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Structure, Function & Genetics 21: 319–344.
- View Article
- Google Scholar
37. Wu JS, Liu HD, Duan XY, Ding Y, Wu HT, et al. (2009) Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 25: 30–35.
- View Article
- Google Scholar
38. Dehzangi A, Phon-Amnuaisuk S, Dehzangi O (2010) Using Random Forest for Protein Fold Prediction Problem: An Empirical Study. Journal of Information Science and Engineering 26: 1941–1956.
- View Article
- Google Scholar
39. Liu ZP, Wu LY, Wang Y, Zhang XS, Chen LN (2010) Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics 26: 1616–1622.
- View Article
- Google Scholar
40. Kandaswamy KK, Chou KC, Martinetz T, Moller S, Suganthan PN, et al. (2011) AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. Journal of Theoretical Biology 270: 56–62.
- View Article
- Google Scholar
41. Breiman L (2000) Randomizing outputs to increase prediction accuracy. Machine Learning 40: 229–242.
- View Article
- Google Scholar
42. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
- View Article
- Google Scholar
43. Rogers J, Gunn S (2006) Identifying feature relevance using a random forest. Subspace, Latent Structure and Feature Selection 3940: 173–184.
- View Article
- Google Scholar
44. Chou KC, Zhang CT (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 30: 275–349.
- View Article
- Google Scholar
45. Chou KC, Shen HB (2008) Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms, Natural Science, 2010, 2, 1090–1103). Nature Protocols 3: 153–162.
- View Article
- Google Scholar
46. Esmaeili M, Mohabatkar H, Mohsenzadeh S (2010) Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology 263: 203–209.
- View Article
- Google Scholar
47. Chen C, Chen L, Zou X, Cai P (2009) Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Protein & Peptide Letters 16: 27–31.
- View Article
- Google Scholar
48. Georgiou DN, Karakasidis TE, Nieto JJ, Torres A (2009) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. Journal of Theoretical Biology 257: 17–26.
- View Article
- Google Scholar
49. Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS One 6: e18258.
- View Article
- Google Scholar
50. Gu Q, Ding YS, Zhang TL (2010) Prediction of G-Protein-Coupled Receptor Classes in Low Homology Using Chou's Pseudo Amino Acid Composition with Approximate Entropy and Hydrophobicity Patterns. Protein & Peptide Letters 17: 559–567.
- View Article
- Google Scholar
51. Mohabatkar H, Mohammad Beigi M, Esmaeili A (2011) Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology 281: 18–23.
- View Article
- Google Scholar
52. Xiao X, Wu ZC, Chou KC (2011) A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS One 6: e20592.
- View Article
- Google Scholar
53. Mohabatkar H (2010) Prediction of cyclin proteins using Chou's pseudo amino acid composition. Protein & Peptide Letters 17: 1207–1214.
- View Article
- Google Scholar
54. Yu L, Guo Y, Li Y, Li G, Li M, et al. (2010) SecretP: Identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. Journal of Theoretical Biology 267: 1–6.
- View Article
- Google Scholar
55. Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, et al. (2009) Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. Journal of Theoretical Biology 259: 366–372.
- View Article
- Google Scholar
56. Qiu JD, Huang JH, Shi SP, Liang RP (2010) Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein & Peptide Letters 17: 715–722.
- View Article
- Google Scholar
57. Zhou XB, Chen C, Li ZC, Zou XY (2007) Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology 248: 546–551.
- View Article
- Google Scholar

[ref1] 1. Langlois RE, Lu H (2010) Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res 38: 3149–3158.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Research 25: 31–36.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Shanahan HP, Garcia MA, Jones S, Thornton JM (2004) Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Research 32: 4732–4741.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Ahmad S, Gromiha MM, Sarai A (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 20: 477–486.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Nordhoff E, Krogsdam AM, Jorgensen HF, Kallipolitis BH, Clark BF, et al. (1999) Rapid identification of DNA-binding proteins by mass spectrometry. Nat Biotechnol 17: 884–888.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Nanni L, Lumini A (2009) An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 36: 167–175.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Brown JB, Akutsu T (2009) Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC Bioinformatics 10: 25.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Cai YD, Lin SL (2003) Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta 1648: 127–133.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Bhardwaj N, Lu H (2007) Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett 581: 1058–1066.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Yu X, Cao J, Cai Y, Shi T, Li Y (2006) Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 240: 175–184.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Fang Y, Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids 34: 103–109.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Shao X, Tian Y, Wu L, Wang Y, Jing L, et al. (2009) Predicting DNA- and RNA-binding proteins from sequences with kernel methods. J Theor Biol 258: 289–293.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Bhardwaj N, Langlois RE, Zhao G, Lu H (2005) Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res 33: 6486–6493.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Kumar M, Gromiha MM, Raghava GP (2007) Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 8: 463.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Nanni L, lumini A (2008) Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids 34: 635–641.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Patel AK, Patel S, Naik PK (2010) Prediction and Classification of DNA Binding Proteins into Four Major Classes Based on Simple Sequence Derived Features Using Ann. Digest Journal of Nanomaterials and Biostructures 5: 191–200.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref17] 17. Patel AK, Patel S, Naik PK (2009) Binary Classification of Uncharacterized Proteins into DNA Binding/Non-DNA Binding Proteins from Sequence Derived Features Using Ann. Digest Journal of Nanomaterials and Biostructures 4: 775–782.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref18] 18. Molparia B, Goyal K, Sarkar A, Kumar S, Sundar D (2010) ZiF-Predict: a web tool for predicting DNA-binding specificity in C2H2 zinc finger proteins. Genomics Proteomics Bioinformatics 8: 122–126.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref19] 19. Ahmad S, Sarai A (2004) Moment-based prediction of DNA-binding proteins. Journal of Molecular Biology 341: 65–71.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref20] 20. Keil M, Exner TE, Brickmann J (2004) Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network. J Comput Chem 25: 779–789.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y (2003) Annotating nucleic acid-binding function based on protein structure. Journal of Molecular Biology 326: 1065–1079.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref22] 22. Kumar KK, Pugalenthi G, Suganthan PN (2009) DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 26: 679–686.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref23] 23. Nimrod G, Schushan M, Szilagyi A, Leslie C, Ben-Tal N (2010) iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics 26: 692–693.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref24] 24. Nimrod G, Szilagyi A, Leslie C, Ben-Tal N (2009) Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J Mol Biol 387: 1040–1053.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref25] 25. Cai Y, He J, Li X, Lu L, Yang X, et al. (2009) A novel computational approach to predict transcription factor DNA binding preference. J Proteome Res 8: 999–1003.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref26] 26. Neumann A, Holstein J, Le Gall JR, Lepage E (2004) Measuring performance in health care: case-mix adjustment by boosted decision trees. Artif Intell Med 32: 97–113.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref27] 27. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29: 2994–3005.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref28] 28. Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology 273: 236–247.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref29] 29. Chou K-C, Shen H-B (2007) Recent progress in protein subcellular location prediction. Analytical Biochemistry 370: 1–16.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref30] 30. Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res 33: W94–98.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref31] 31. Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19: 1589–1591.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref32] 32. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43: 246–255.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref33] 33. Deng JL (1989) Introduction to Grey System Theory. The Journal of Grey System 1–24.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref34] 34. Xiao X, Lin WZ, Chou KC (2008) Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 29: 2018–2024.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref35] 35. Atchley WR, Zhao JP, Fernandes AD, Druke T (2005) Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America 102: 6395–6400.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref36] 36. Chou KC (1995) A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Structure, Function & Genetics 21: 319–344.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref37] 37. Wu JS, Liu HD, Duan XY, Ding Y, Wu HT, et al. (2009) Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 25: 30–35.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref38] 38. Dehzangi A, Phon-Amnuaisuk S, Dehzangi O (2010) Using Random Forest for Protein Fold Prediction Problem: An Empirical Study. Journal of Information Science and Engineering 26: 1941–1956.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref39] 39. Liu ZP, Wu LY, Wang Y, Zhang XS, Chen LN (2010) Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics 26: 1616–1622.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref40] 40. Kandaswamy KK, Chou KC, Martinetz T, Moller S, Suganthan PN, et al. (2011) AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. Journal of Theoretical Biology 270: 56–62.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref41] 41. Breiman L (2000) Randomizing outputs to increase prediction accuracy. Machine Learning 40: 229–242.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref42] 42. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref43] 43. Rogers J, Gunn S (2006) Identifying feature relevance using a random forest. Subspace, Latent Structure and Feature Selection 3940: 173–184.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref44] 44. Chou KC, Zhang CT (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 30: 275–349.
View Article
Google Scholar

[131] View Article

[132] Google Scholar

[ref45] 45. Chou KC, Shen HB (2008) Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms, Natural Science, 2010, 2, 1090–1103). Nature Protocols 3: 153–162.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref46] 46. Esmaeili M, Mohabatkar H, Mohsenzadeh S (2010) Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology 263: 203–209.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref47] 47. Chen C, Chen L, Zou X, Cai P (2009) Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Protein & Peptide Letters 16: 27–31.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref48] 48. Georgiou DN, Karakasidis TE, Nieto JJ, Torres A (2009) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. Journal of Theoretical Biology 257: 17–26.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref49] 49. Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS One 6: e18258.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref50] 50. Gu Q, Ding YS, Zhang TL (2010) Prediction of G-Protein-Coupled Receptor Classes in Low Homology Using Chou's Pseudo Amino Acid Composition with Approximate Entropy and Hydrophobicity Patterns. Protein & Peptide Letters 17: 559–567.
View Article
Google Scholar

[149] View Article

[150] Google Scholar

[ref51] 51. Mohabatkar H, Mohammad Beigi M, Esmaeili A (2011) Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology 281: 18–23.
View Article
Google Scholar

[152] View Article

[153] Google Scholar

[ref52] 52. Xiao X, Wu ZC, Chou KC (2011) A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS One 6: e20592.
View Article
Google Scholar

[155] View Article

[156] Google Scholar

[ref53] 53. Mohabatkar H (2010) Prediction of cyclin proteins using Chou's pseudo amino acid composition. Protein & Peptide Letters 17: 1207–1214.
View Article
Google Scholar

[158] View Article

[159] Google Scholar

[ref54] 54. Yu L, Guo Y, Li Y, Li G, Li M, et al. (2010) SecretP: Identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. Journal of Theoretical Biology 267: 1–6.
View Article
Google Scholar

[161] View Article

[162] Google Scholar

[ref55] 55. Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, et al. (2009) Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. Journal of Theoretical Biology 259: 366–372.
View Article
Google Scholar

[164] View Article

[165] Google Scholar

[ref56] 56. Qiu JD, Huang JH, Shi SP, Liang RP (2010) Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein & Peptide Letters 17: 715–722.
View Article
Google Scholar

[167] View Article

[168] Google Scholar

[ref57] 57. Zhou XB, Chen C, Li ZC, Zou XY (2007) Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology 248: 546–551.
View Article
Google Scholar

[170] View Article

[171] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

1. Benchmark datasets

2. A novel pseudo amino acid composition of grey model

3. Predicting algorithm

Results and Discussion

Supporting Information

Information S1.

Information S2.

Information S3.

Information S4.

Acknowledgments

Author Contributions

References