Skip to main content
Advertisement
  • Loading metrics

Metabolic pathway inference using multi-label classification with rich pathway features

  • Abdur Rahman M. A. Basher,

    Roles Conceptualization, Formal analysis, Investigation, Software, Validation, Visualization, Writing – original draft

    Affiliation Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada

  • Ryan J. McLaughlin,

    Roles Formal analysis, Investigation, Validation, Visualization, Writing – review & editing

    Affiliation Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada

  • Steven J. Hallam

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Visualization, Writing – original draft, Writing – review & editing

    shallam@mail.ubc.ca

    Affiliations Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada, Department of Microbiology & Immunology, University of British Columbia, 2552-2350 Health Sciences Mall, Vancouver, British Columbia, Canada, Genome Science and Technology Program, University of British Columbia, 2329 West Mall, Vancouver, BC, Canada, Life Sciences Institute, University of British Columbia, Vancouver, British Columbia, Canada, ECOSCOPE Training Program, University of British Columbia, Vancouver, British Columbia, Canada

Abstract

Metabolic inference from genomic sequence information is a necessary step in determining the capacity of cells to make a living in the world at different levels of biological organization. A common method for determining the metabolic potential encoded in genomes is to map conceptually translated open reading frames onto a database containing known product descriptions. Such gene-centric methods are limited in their capacity to predict pathway presence or absence and do not support standardized rule sets for automated and reproducible research. Pathway-centric methods based on defined rule sets or machine learning algorithms provide an adjunct or alternative inference method that supports hypothesis generation and testing of metabolic relationships within and between cells. Here, we present mlLGPR, multi-label based on logistic regression for pathway prediction, a software package that uses supervised multi-label classification and rich pathway features to infer metabolic networks in organismal and multi-organismal datasets. We evaluated mlLGPR performance using a corpora of 12 experimental datasets manifesting diverse multi-label properties, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previous reports for organismal genomes and identify specific challenges associated with features engineering and training data for community-level metabolic inference.

Author summary

Predicting the complex series of metabolic interactions e.g. pathways, within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. This is a prerequisite to both understanding fundamental life processes and ultimately engineering these processes for specific biotechnological applications. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like Esherichia coli where the majority of protein functions are determined. To improve pathway prediction outcomes for genomes at different levels of complexity and completion we have developed mlLGPR, multi-label based on logistic regression for pathway prediction, a scalable open source software package that uses supervised multi-label classification and rich pathway features to infer metabolic networks. We benchmark mlLGPR performance against other inference methods providing a code base and metrics for continued application of machine learning methods to the pathway prediction problem.

Introduction

Metabolic inference from genomic sequence information is a fundamental problem in biology with far reaching implications for our capacity to perceive, evaluate and engineer cells at the individual, population and community levels of organization [1, 2]. Predicting metabolic interactions can be described in terms of molecular events or reactions coordinated within a series or cycle. The set of reactions within and between cells defines a reactome, while the set of linked reactions defines pathways within and between cells. Reactomes and pathways can be predicted from primary sequence information and refined using mass spectrometry to both validate known and uncover novel pathways.

The development of reliable and flexible rule sets for metabolic inference is a non-trivial step that requires manual curation to add accurate taxonomic or pathway labels [3]. This problem is compounded by the ever increasing abundance of different data structures sourced from organismal genomes, single-cell amplified gemomes (SAGs) and metagenome assembled genomes (MAGs) (Fig 1). Under ideal circumstances, pathways are inferred from a bounded reactome that has been manually curated to reflect detailed biochemical knowledge from a closed reference genome e.g. T1 in the information hierarchy in (Fig 1). While this is possible for a subset of model organisms, it becomes increasingly difficult to realize when dealing with the broader range of organismal diversity found in natural and engineered environments. At the same time, advances in sequencing and mass spectrometry platforms continue to lower the cost of data generation resulting in exponential increases in the volume and complexity of multi-omic information (DNA, RNA, protein and metabolite) available for metabolic inference [4].

thumbnail
Fig 1. Genomic information hierarchy encompassing individual, population and community levels of cellular organization.

(a) Building on the BioCyc curation-tiered structure of Pathway/Genome Databases (PGDBs) constructed from organismal genomes, two additional data structures are resolved from single-cell and plurality sequencing methods to define a 4 tiered hierarchy (T1-4) in descending order of manual curation and functional validation. (b) Completion scales for organismal genomes, single-cell amplified gemomes (SAGs) and metagenome assembled genomes (MAGs) within the 4 tiered information hierarchy. Genome completion will have a direct effect on metabolic inference outcomes with incomplete organismal genomes, SAGs or MAGS resolving fewer metabolic interactions.

https://doi.org/10.1371/journal.pcbi.1008174.g001

Over the past three decades, several trusted sources have emerged to collect and curate reactomes and pathways based on biochemical knowledge including the Kyoto Encyclopedia of Genes and Genomes (KEGG) [5], Reactome [6], and MetaCyc [7]. MetaCyc is a multi-organism member of the BioCyc collection of Pathway/Genome Databases (PGDB) [8] that contains only experimentally validated metabolic pathways across all domains of life (currently over 2766 pathways from 3067 different organisms). Pathway/Genome Databases can be constructed in Pathway Tools, a production-quality software environment developed at SRI that supports metabolic inference based on the MetaCyc database [9]. Navigable and extensively commented pathway descriptions, literature citations, and enzyme properties combined within a PGDB provide a coherent structure for exploring and interpreting pathways in genomes to biomes. Metabolic inference in Pathway Tools is based on the use of a rule-based algorithm called PathoLogic [10] producing organismal PGDBs e.g. EcoCyc [11] stored in repositories e.g. BioCyc [12] that can be refined based on experimental validation. In addition to organismal PDGBs, pathologic can be used to produce microbiome or environmental Pathway/Genome Databases (ePGDBs) representing community level metabolic models e.g. T4 on the information hierarchy in (Fig 1) [1315] that can also be stored in open source repositories e.g. EngCyc or GutCyc [14, 16].

While PathoLogic provides a powerful engine for pathway-centric inference, it is a hard coded and relatively inflexible application that does not not scale efficiently for community sequencing projects. Moreover, PathoLogic does not provide probability scores associated with inferred pathways further limiting its statistical power with respect to false discovery. An alternative inference method called MinPath uses integer programming to identify the minimum number of pathways that can be described given a set of defined input sequences e.g. KO family annotations in KEGG [17]. However, such a parsimony approach is prone to false negatives and can be difficult to scale. Issues of probability and scale have led to the consideration of machine learning (ML) approaches for pathway prediction based on rich feature information. Dale and colleagues conducted a comprehensive comparison of PathoLogic to different types of supervised ML algorithms including naive Bayes, k nearest neighbors, decision trees and logistic regression, converting PathoLogic rules into features and defining new features for pathway inference [18]. They evaluated these algorithms on experimentally validated pathways from six T1 PGDBs in the BioCyc collection randomly divided into training and test sets. Resulting performance metrics indicated that generic ML methods equaled or marginally exceeded the performance of PathoLogic with the benefit of probability estimation for pathway presence and increased flexibility and transparency of use.

Despite the potential benefits of adopting ML methods for pathway prediction from genomic sequence information, PathoLogic remains the primary inference engine of Pathway Tools [9], and alternative methods for pathway-centric inference expanding on the algorithms evaluated by Dale and colleagues remain nascent. Several recent efforts incorporate metabolite information to improve pathway inference and reaction rules to infer metabolic pathways [3, 1921]. Others, including BiomeNet [22] and MetaNetSim [23] omit pathways and model reaction networks based on enzyme abundance information. Here we describe a multi-label classification approach to metabolic pathway inference using rich pathway feature information called mlLGPR, multi-label based on logistic regression for pathway prediction. mlLGPR uses logistic regression and feature vectors inspired by the work of Dale and colleagues to predict metabolic pathways for individual genomes as well as more complex cellular communities e.g. microbiomes. We evaluate mlLGPR performance in relation to other inference methods including PathoLogic and MinPath on a set of T1 PGDBs alone and in combination from the BioCyc collection, symbiont genomes encoding distributed metabolic pathways for amino acid biosynthesis [24], genomes used in the Critical Assessment of Metagenome Interpretation (CAMI) initiative [25], and whole genome shotgun sequences from the Hawaii Ocean Time Series (HOTS) [26].

The mlLGPR method

In this section, we provide a series of definitions and the problem formulation followed by a description of mlLGPR components (Fig 2) including: i)- features representation, ii)- the prediction model, and iii)- the multi-label learning process. mlLGPR was written in Python v3 and depends on scikit-learn v0.20 [27], Numpy v1.16 [28], NetworkX v2.3 [29], and SciPy v1.4 [30]. The mlLGPR workflow is presented in (Fig 1).

thumbnail
Fig 2. mlLGPR workflow.

Datasets spanning the information hierarchy are used in feature engineering. The Synthetic dataset with features is split into training and test sets and used to train mlLGPR. Test data from the Gold Standard dataset (T1) with features and Synthetic dataset (T1-3) with features is used to evaluate mlLGPR performance prior to the application of mlLGPR on experimental datasets (T4) from different sources.

https://doi.org/10.1371/journal.pcbi.1008174.g002

Definitions and problem formulation

Here, the default vector is considered to be a column vector and is represented by a boldface lowercase letter (e.g., x) while the matrix of it is denoted by boldface uppercase letter (e.g., X). Unless otherwise mentioned, if a subscript letter i is attached to a matrix, such as Xi, it indicates the i-th row of X, which is a row vector while a subscript character to a vector, xi, represents an i-th cell of x. Occasional superscript, x(i), suggests an index to a sample or current epoch during learning period. With these notations in mind, we introduce the metabolic pathway inference problem by first defining the pathway dataset.

Metabolic pathway inference can be formulated as a supervised multi-label prediction problem. This is because a genome encodes multiple pathway labels per instance. Formally, let be a pathway dataset consisting of n examples, where x(i) is a vector indicating abundance information for corresponding enzymatic reactions. An enzymatic reaction is denoted by e, which is an element of a set , having r possible enzymatic reactions, hence, the vector size x(i) is r. The abundance of an enzymatic reaction for an example i, say , is defined as . The class labels is a pathway label vector of size t that represents the total number of pathways, which are derived from a set of universal metabolic pathway . The matrix form of x(i) and y(i) are X and Y, respectively.

We further denote as the d-dimensional input space, and transform each sample into an arbitrary m-dimensional vector based on a transformation function where md. The transformation function for each sample i is defined by , which can be described as a feature extraction and transformation process (see Section Features engineering). Given the above notation and a multi-label dataset , we want to learn a hypothesis function from , such that it predicts metabolic pathways in new samples as accurately as possible.

Features engineering

The design of feature vectors is critical for accurate classification and pathway inference. We consider five types of feature vectors inspired by the work of Dale and colleagues [18]: i)- enzymatic reactions abundance vector (ϕa), ii)- reactions evidence vector (ϕf), iii)- pathways evidence vector (ϕy), iv)- pathway common vector (ϕc), and v)- possible pathways vector (ϕd). The transformation process ϕa is represented by r-dimensional frequency vector, corresponding to the number of occurrences for each enzymatic reaction as ϕa = [a1, a2, …, ar]. An enzymatic reaction is characterized by an enzyme commission (EC) classification number [31]. The reaction evidence vector ϕf indicates the properties of the enzymatic reaction for each sample. The pathway evidence features ϕy include a modified subset of features developed by Dale and colleagues expanding on core PathoLogic rule sets to include additional information related to enzyme presence, gaps in pathways, network connectivity, taxonomic range, etc [18]. The pathway common feature vector ϕc, for a sample x(i) is represented by r-dimensional binary vector and the possible pathways vector ϕd is a t-dimensional binary vector. Each of the transformation function maps x to a different dimensional vector, and the concatenated feature vector Φ = [ϕa(x(i)), ϕf(x(i)), ϕy(x(i)), ϕc(x(i)), ϕd(x(i))] has a total of m-dimensional features for each sample. For a more in-depth description of the feature engineering process please refer to S2 Appendix).

Prediction model

We use the logistic regression (LR) model to infer a set of pathways given an instance feature vector Φ(x(i)). LR was selected because of its proven power in discriminative classification across a variety of supervised machine learning problems [32]. In addition to direct probabilistic interpretation integrated into the model, LR can handle high-dimensional data, efficiently. The LR model represents conditional probabilities through a non-linear logistic function f(.) defined as (1) where is the j-th element of the label vector y(i) ∈ {0, 1}t and θj is a m-dimensional weight vector for the j-th pathway. Each element of Φ(x(i)) corresponds to an element of θj for the j-class, therefore, we can retrieve important features that contribute to the prediction of j by sorting the elements of Φ(x(i)) according to the corresponding values of the weight vector θj. The Eq 1 is repeated for all the t classes for an instance i, hence multi-labeling, and, for an individual pathway, the results are stored in a vector . Predicted pathways are reported based on a cut-off threshold τ, which is set to 0.5 by default: (2) where vec is a vectorized operation. Given that Eq 1 produces a conditional probability over each pathway, and the j-th class label will be included to y(i) only if f(θj, Φ(x(i))) ≥ τ we adopt a soft decision boundary using T-criterion rule [33] as: (3) where fmax(f(θj, Φ(x(i)))) = β ⋅ max({f(θj, Φ(x(i)):∀jt}), which is the maximum predictive probability score. The hyper-parameter β ∈ (0, 1] must be tuned based on empirical information, and it cannot be set to 0, which implies retrieving all of the t pathways. The predicted set of pathways using the Eq 3 is referred to as adaptive prediction because the decision boundary, and its corresponding threshold, are tuned to the test data [34].

Multi-label learning process

The process is decomposed into t independent binary classification problems, where each binary classification problem corresponds to a possible pathway in the label space. Then, LR is used to define a binary classifier f(.), such that for a training example (Φ(x(i)), y(i)), an instance Φ(x(i)) will be involved in the learning process of t binary classifiers. Given n training samples, we attempt to estimate all the weight vectors individually θ1, θ2, …, θt by maximizing the logistic loss function as follows: (4)

Usually, a penalty or regularization term Ω(θj) is inserted into the loss function to enhance the generalization properties to unseen data, particularly if the dimension m of features is high. Thus, the overall objective cost function (after dropping the maximized term for brevity) is defined as: (5) where λ > 0 is a hyper-parameter that controls the trade-off between ll(θj) and Ω(θj). Here, the regularization term Ω(θj) is chosen to be the elastic net: (6)

The elastic net penalty of Eq 6 is a compromise between the L1 penalty of LASSO (by setting α = 1) and the L2 penalty of ridge-regression (by setting α = 0) [35]. While the L1 term of the elastic net aims to remove irrelevant variables by forcing some coefficients of θj to 0, leading to a sparse vector of θj, the L2 penalty ensures that highly correlated variables have similar regression coefficients. Substituting Eq 6 into Eq 5, yields the following objective function: (7)

During learning, the aim is to estimate parameters θj so as to maximize C(θj), which is convex; however, the last term of Eq 7 is non-differentiable, making the equation non-smooth. For the rightmost term, we apply the sub-gradient [36] method allowing the optimization problem to be solved using mini-batch gradient descent (GD) [37]. We initialize with random values for θj, followed by iterations to maximize the cost function C(θj) with the following derivatives: (8)

Finally, the update algorithm for θj at each iteration is obtained as: (9) where u is the current step. The mathematical derivation of the algorithm can be found in S1 Appendix.

Experimental setup

In this section, we describe an experimental framework used to demonstrate mlLGPR pathway prediction performance across multiple datasets spanning the genomic information hierarchy (Fig 1). MetaCyc version 21 containing 2526 base pathways and 3650 enzymatic reactions, was used as a trusted source to generate samples, build features, and validate results from the prediction algorithms, as outlined in Section Results. For training we constructed two synthetic datasets Synset 1 and Synset 2 based on the Poisson distribution to subsample pathways, aligning with the previous work [22], from a list of MetaCyc pathways.

We evaluated mlLGPR performance using a corpora of 12 experimental datasets manifesting diverse multi-label properties, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. The T1 golden dataset consisted of six PGDBs including AraCyc, EcoCyc, HumanCyc, LeishCyc, TrypanoCyc, and YeastCyc, A composite golden dataset, referred to as SixDB, consisted of 63 permuted combinations of T1 PGDBs. In addition to datasets derived from the BioCyc collection, we evaluated performance using low complexity data from Moranella (GenBank NC-015735) and Tremblaya (GenBank NC-015736) symbiont genomes encoding distributed metabolic pathways for amino acid biosynthesis [24], the Critical Assessment of Metagenome Interpretation (CAMI) initiative low complexity dataset [25], and whole genome shotgun sequences from the Hawaii Ocean Time Series (HOTS) at 25m, 75m, 110m (sunlit) and 500m (dark) ocean depth intervals [26]. More information about the datasets are summarized in S3 Appendix.

mlLGPR performance was compared to four additional prediction methods including BASELINE, Naïve v1.2 [17], MinPath v1.2 [17] and PathoLogic v21 [10]. In the BASELINE method, the enzymatic reactions of an example x(i) are mapped directly onto the true representation of all known pathways . Then, we apply a cutoff threshold (0.5) to retrieve a list of pathways for that example. In the Naïve method, reactions are randomly predicted from MetaCyc and linked together to construct pathways that are accepted or rejected based on a specified cut-off threshold, typically set to 0.5. If one or more enzymatic reactions are assigned to a pathway then that pathway is identified as present; otherwise, it is rejected. MinPath recovers the minimal set of pathways that can explain observed enzymatic reactions through an iterative constrained optimization process using an integer programming algorithm [38]. PathoLogic is a rule-based metabolic inference method incorporating manually curated biochemical information in a two step process that first produces a reactome that is in turn used to predict metabolic pathways within a PGDB [10].

For training purposes Synset-1 and Synset-2, were subdivided in three subsets: (training set, validation set, and test set), using a stratified sampling approach [39] resulting in 10, 869 training, 1938 validation and 2193 testing samples for Synset-1 and 10, 813 training, 1, 930 validation, and 2, 257 instances for Synset-2. Features extraction was implemented for each dataset in Table 1, resulting in total feature vector size of 12452 for each instance, where |ϕa| = 3650, |ϕf| = 68, |ϕy| = 32, |ϕc| = 3650, and |ϕd| = 5052. Integral hyper-parameter settings included Θ initialized to a uniform random value in the range [0, 1], batch-size set to 500, epoch number set to 3, adaptive prediction hyper-parameter β in the range (0, 1], regularization hyper-parameters λ and α set to 10000 and 0.65, respectively. The learning rate η was adjusted based on , where u denotes the current step. The development set was used to determine critical values of λ and α. Default parameter settings were used for MinPath and PathoLogic. All tests were conducted using a Linux server using 10 cores on an Intel Xeon CPU E5-2650.

thumbnail
Table 1. Experimental dataset properties.

The notations , L(), LCard(), LDen(), DL(), and PDL() represent number of instances, number of pathway labels, pathway labels cardinality, pathway labels density, distinct pathway labels set, and proportion of distinct pathway labels set for , respectively. The notations R(), RCard(), RDen(), DR(), and PDR() have similar meanings as before but for the enzymatic reactions in . PLR() represents a ratio of L() to R(). The last column denotes the domain of .

https://doi.org/10.1371/journal.pcbi.1008174.t001

Performance metrics

The following metrics were used to report on performance of prediction algorithms used in the experimental framework outlined above: average precision, average recall, average F1 score (F1), and Hamming loss, [40].

Formally, let us denote y(i) and to be the true and the predicted pathway set for the i-the sample, respectively. Then, the four measurements can be defined as: (10) (11) (12) (13) where 1(.) denotes the indicator function, respectively. Each metric is averaged based on sample size.

The values of average precision, average recall, and average F1 vary between 0 − 1 with 1 being the optimal score. Average Precision relates the number of true pathways to the number of predicted pathways including false positives, while recall relates the number of true pathways to the total number of expected pathways including false negatives. While recall tells us about the ability of each prediction method to find relevant pathways, precision tells us about the accuracy of those predictions. Average F1 represents the harmonic mean of average precision and average recall by taking the trade-off between the two metrics into account. The hloss is the fraction of pathways that are incorrectly predicted providing a useful performance indicator. From Eq 13, we observe that when all of the pathways are correctly predicted, then hloss = 0, whereas the other metrics will be equal to 1. On the other hand, when the predictions of all pathways are completely incorrect hloss = 1, whereas the other metrics will be equal to 0.

Results

Four types of analysis including parameter sensitivity, features selection, robustness, and pathway prediction potential were used to tune and evaluate mlLGPR performance in relation to other pathway prediction methods.

Parameter sensitivity

Experimental setup.

Three consecutive tests were performed to ascertain: 1)- the impact of L1, L2, and elastic-net (EN) regularizers on mlLGPR performance using T1 golden datasets, 2)- the impact of changing hyper-parameter λ ∈ {1, 10, 100, 1000, 10000} using T1 golden datasets, and 3)- the impact of adaptive beta β ∈ (0, 1] using Synset-2 and the SixDB golden dataset.

Experimental results.

Table 2 indicates test results across different mlLGPR configurations. Although the F1 scores of mlLGPR-L1, mlLGPR-L2 and mlLGPR-EN were comparable, precision and recall scores were inconsistent across the T1 golden datasets. For example, high precision scores were observed for mlLGPR-L2 on AraCyc (0.8418) and YeastCyc (0.7934) with low recall scores of 0.5529 and 0.7380, respectively. In contrast, high recall scores were observed for mlLGPR-L1 on AraCyc (0.7275) and YeastCyc (0.8690) with low precision scores of 0.7390 and 0.6815, respectively. The increased recall with reduced precision scores by mlLGPR-L1 indicates a low variance model that may eliminate many relevant coefficients. The impact is especially observed for datasets encoding a small number of pathways as is the case for LeishCyc (87 pathways) and TrypanoCyc (175 pathways). Similarly, the increased precision with reduced recall scores by mlLGPR-L2 is a consequence of the existence of highly correlated features present in the test datasets [41], resulting in a high variance model. The impact is especially observed for LeishCyc and TrypanoCyc suggesting that mlLGPR-L2 performance declines with increasing pathway number. mlLGPR-EN tended to even out the scores relative to mlLGPR-L1 and mlLGPR-L2 providing more balanced performance outcomes.

thumbnail
Table 2. Predictive performance of mlLGPR on T1 golden datasets.

mlLGPR-L1: the mlLGPR with L1 regularizer, mlLGPR-L2: the mlLGPR with L2 regularizer, mlLGPR-EN: the mlLGPR with elastic net penalty, L2: AB: abundance features, RE: reaction evidence features, and PE: pathway evidence features. For each performance metric, ‘↓’ indicates the lower score is better while ‘↑’ indicates the higher score is better.

https://doi.org/10.1371/journal.pcbi.1008174.t002

Based on these results, hyper-parameters λ and β were tested to tune mlLGPR-EN performance. Fig 3 indicates that the relationship between F1 score and the regularization hyper-parameter λ increases monotonically for the T1 golden datasets peaking at λ = 10000 (having an F1 score of > 0.6 for all datasets). For the adaptive β test, Fig 4 shows the performance of mlLGPR-EN on Synset-2 test samples across a range of β ∈ (0, 1] values, indicating that this hyper-parameter has minimal impact on performance.

thumbnail
Fig 3. Average F1 score of mlLGPR-EN on a range of regularization hyper-parameter λ ∈ {1, 10, 100, 1000, 10000} values on EcoCyc, HumanCyc, AraCyc, YeastCyc, LeishCyc, TrypanoCyc, and SixDB dataset.

The x-axis is log scaled.

https://doi.org/10.1371/journal.pcbi.1008174.g003

thumbnail
Fig 4. Performance of mlLGPR-EN according to the β adaptive decision hyper-parameter on datasets.

(a)- Synset-2 test dataset. (b)- SixDB dataset.

https://doi.org/10.1371/journal.pcbi.1008174.g004

Taken together, parameter testing results indicated that mlLGPR-EN provided the most balanced implementation of mlLGPR, and the regularization hyper-parameter λ at 10000 resulted in the best performance for T1 golden datasets. This hyper-parameter should be tuned when applied to new datasets to reduce false positive pathway discovery. Minimal effects on prediction performance were observed when testing the adaptive hyper-parameter β.

Features selection

Experimental setup.

A series of feature set “ablation” tests were conducted using Synset-2 as a training set in a reverse manner, starting with only reaction abundance features (AB), a fundamental feature set consisting of 3650 features and then successively aggregating additional feature sets while recording predictive performance on golden T1 datasets using the settings and metrics described in Section Experimental setup. Because testing individual features is not practical, this form of aggregate testing provides a tractable method to identify the relative contribution of feature sets to pathway prediction performance.

Experimental results.

Table 3 indicates ablation test results. The AB feature set promotes the highest average recall on EcoCyc (0.9511) and a comparable F1-score of 0.6952. This is not unexpected given the ratio of pathways to the number of enzymatic reactions (PLR) indicated by EC numbers for EcoCyc is high. However, although functional annotations with EC numbers increase the probability of predicting a given pathway, pathways with few or no EC numbers such as pregnenolone biosynthesis require additional feature sets to avoid false negatives. As additional feature sets are aggregated, mlLGPR-EN performance tends to improve unevenly for different T1 organismal genomes. For example, adding the enzymatic reaction evidence (RE) feature set consisting of 68 features to the AB features set improves F1 scores for YeastCyc (0.7394), LeishCyc (0.5830), and TrypanoCyc (0.6753). Further aggregating the pathway evidence (PE) feature set, consisting of 32 features to the AB feature set improves the F1 score for AraCyc (0.7532) but reduces the F1 score for the remaining T1 organismal genomes. Aggregating AB, RE and pathway evidence (PE) feature sets resulted in the highest F1 scores for HumanCyc (0.7468), LeishCyc (0.6220), TrypanoCyc (0.6768), and SixDB (0.7078) with only marginal differences between the highest F1 scores for EcoCyc (0.7275) and AraCyc (0.7343). Additional combinations of features did not improve overall performance across the T1 golden datasets.

thumbnail
Table 3. Ablation tests of mlLGPR-EN trained using Synset-2 on T1 golden datasets.

AB: abundance features, RE: reaction evidence features, PP: possible pathway features, PE: pathway evidence features, and PC: pathway common features. mlLGPR is trained using a combination of features, represented by mlLGPR-*, on Synset-2 training set. For each performance metric, ‘↓’ indicates the lower score is better while ‘↑’ indicates the higher score is better.

https://doi.org/10.1371/journal.pcbi.1008174.t003

Taken together, ablation testing results indicated that mlLGPR-EN in combination with AB, RE and PE feature sets result in the most even pathway prediction performance for golden T1 datasets.

Robustness

Experimental setup.

Robustness also known as accuracy loss rate was determined for mlLGPR-EN with AB, RE and PE feature sets using the intact Synset-1 dataset and a “corrupted” or noisy version of the Synset-2 dataset. Relative Loss of Accuracy (RLA) and equalized loss of accuracy (ELA) scores [42] were used to describe the expected behavior of mlLGPR-EN in relation to introduced noise. The ELA score explained in Section 2 in S3 Appendix, encompasses i)- the robustness of a model determined at a controlled noise threshold ρ, and ii)- the performance of a model without noise, i.e., s(M0), where s represents the F1 score for a model M0 without noise (any performance metrics can be employed). A low robustness score indicates that model continues to exhibit good performance with increasing background noise.

Experimental results.

Table 4 indicates robustness test scores. mlLGPR-EN with introduced noise performed better for HumanCyc (−0.0502), YeastCyc (−0.0301), LeishCyc (−0.1189), and TrypanoCyc (−0.0151), but was less robust for AraCyc (0.0416) and SixDB (0.0470) based on RLAρ scores. This suggests that noise inversely correlates with the pathway size. The more pathways present within a dataset can upset correlations among features. However, the impact of negative correlations is minimized when a dataset contains fewer pathways. Note that the average number of ECs associated with pathways has little or negligible effects on robustness.

thumbnail
Table 4. Performance and robustness scores for mlLGPR-EN with AB, RE and PE feature sets trained on both Synset-1 and Synset-2 training sets at 0 and ρ noise.

The best performance scores are highlighted in bold. The ‘↓’ indicates the lower score is better while ‘↑’ indicates the higher score is better.

https://doi.org/10.1371/journal.pcbi.1008174.t004

Taken together, the RLA and ELA results for T1 golden datasets indicate that mlLGPR-EN trained on noisy datasets is robust to perturbation. This is a prerequisite for developing supervised ML methods tuned for community-level pathway prediction.

Pathway prediction potential

Experimental setup.

Pathway prediction potential of mlLGPR-EN with AB, RE and PE feature sets trained on Synset-2 training set was compared to four additional prediction methods including Baseline, Naïve v1.2 [17], MinPath v1.2 [17] and PathoLogic v21 [10] on T1 golden datasets using the settings and metrics described above. For community-level pathway prediction on the T4 datasets including symbiont, CAMI low complexity, and HOTS datasets, mlLGPR-EN and PathoLogic (without taxonomic pruning) results were compared.

Experimental results.

Table 5 shows performance scores for each pathway prediction method tested. The BASELINE, Naïve, and MinPath methods infer many false positive pathways across the T1 golden datasets, indicated by high recall with low precision and F1 scores. In contrast, high precision and F1 scores were observed for PathoLogic and mlLGPR-EN across the T1 golden datasets. Although both methods gave similar results, PathoLogic F1 scores for EcoCyc (0.7631), YeastCyc (0.7890) and SixDB (0.7479) exceeded those for mlLGPR-EN. Conversely, mlLGPR-EN F1 scores for HumanCyc (0.7468), AraCyc (0.7343), LeishCyc (0.6220) and TrypanoCyc (0.6768) exceeded those for PathoLogic.

thumbnail
Table 5. Pathway prediction performance between methods using T1 golden datasets.

mlLGPR-EN: the mlLGPR with elastic net penalty, L2: AB: abundance features, RE: reaction evidence features, and PE: pathway evidence features. For each performance metric, ‘↓’ indicates the lower score is better while ‘↑’ indicates the higher score is better.

https://doi.org/10.1371/journal.pcbi.1008174.t005

To evaluate mlLGPR-EN performance on distributed metabolic pathway prediction between two or more interacting organismal genomes a symbiotic system consisting of the reduced genomes for Candidatus Moranella endobia and Candidatus Tremblaya princeps, encoding a previously identified set of distributed amino acid biosynthetic pathways [24], was selected. mlLGPR-EN and PathoLogic were used to predict pathways on individual symbiont genomes and a composite genome consisting of both, and resulting amino acid biosynthetic pathway distributions were determined (Fig 5). mlLGPR-EN predicted 8 out of 9 expected amino acid biosynthetic pathways while PathoLogic recovered 6 on the composite genome. The missing pathway for phenylalanine biosynthesis (L-phenylalanine biosynthesis I was excluded from analysis because the associated genes were reported to be missing during the ORF prediction process. False positives were predicted for individual symbiont genomes in Moranella and Tremblaya using both methods although pathway coverage was low compared to the composite genome. Additional feature information restricting the taxonomic range of certain pathways or more restrictive pathway coverage could reduce false discovery on individual organismal genomes.

thumbnail
Fig 5. Predicted pathways for symbiont datasets between mlLGPR-EN with AB, RE and PE feature sets and PathoLogic.

Red circles indicate that neither method predicted a specific pathway while green circles indicate that both methods predicted a specific pathway. Blue circles indicate pathways predicted solely by mlLGPR. The size of circles scales with reaction abundance information.

https://doi.org/10.1371/journal.pcbi.1008174.g005

To evaluate pathway prediction performance of mlLGPR-EN on more complex community-level genomes the CAMI low complexity and HOTS datasets were selected. Table G in S3 Appendix shows performance scores for mlLGPR-EN on the CAMI dataset. Although recall was high (0.7827) precision and F1 scores were low when compared to the T1 golden datasets. Similar results were obtained for the HOTS dataset. In both cases it is difficult to validate most pathway prediction results without individual organismal genomes that can be replicated in culture. Moreover, the total number of expected pathways per dataset is relatively large, encompassing metabolic interactions at different levels of biological organization. On the one hand, these open conditions confound interpretation of performance metrics while on the other they present numerous opportunities for hypothesis generation and testing. To better constrain this tension, mlLGPR-EN and PathoLogic prediction results were compared for a subset of 45 pathways previously reported in the HOTS dataset [14]. Fig 6 shows pathway distributions spanning sunlit and dark ocean waters predicted by PathoLogic and mlLGPR-EN, grouped according to higher order functions within the MetaCyc classification hierarchy. Between 25 and 500 m depth intervals, 7 pathways were exclusively predicted by PathoLogic and 6 were exclusively predicted by mlLGPR-EN. Another 20 pathways were predicted by both methods, while 6 pathways were not predicted by either method including glycine biosynthesis IV, thiamine diphosphate biosynthesis II and IV, flavanoid biosynthesis, 2-methylcitrate cycle II and L-methionine degradation III. In several instances, the depth distributions of predicted pathways were also different from those described in [14] including L-selenocysteine biosythesis II and acetate formation from acetyl-CoA II. It remains uncertain why current implementation of PathoLogic resulted in inconsistent pathway prediction results, although changes have accrued in PathoLogic rules and the structure of the MetaCyc classification hierarchy in the intervening time interval.

thumbnail
Fig 6. Comparison of predicted pathways for HOTS datasets between mlLGPR-EN with AB, RE and PE feature sets and PathoLogic.

Red circles indicate that neither method predicted a specific pathway while green circles indicate that both methods predicted a specific pathway. Blue circles indicate pathways predicted solely by mlLGPR and gray circles indicate pathways solely predicted by PathoLogic. The size of circles scales with reaction abundance information.

https://doi.org/10.1371/journal.pcbi.1008174.g006

Taken together, the comparative pathway prediction results indicate that mlLGPR-EN performance equals or exceeds other methods including PathoLogic on organismal genomes but diminishes with dataset complexity.

Discussion

We have developed mlLGPR, a new method using multi-label classification and logistic regression to predict metabolic pathways at different levels in the genomic information hierarchy (Fig 1). mlLGPR effectively maps annotated enzymatic reactions using EC numbers onto reference metabolic pathways sourced from the MetaCyc database. We provide a detailed open source process from features engineering and the construction of synthetic samples, on which the mlLGPR is trained, to performance testing on increasingly complex real world datasets including organismal genomes, nested symbionts, CAMI low complexity and HOTS. With respect to features engineering, five feature sets were re-designed from Dale and colleagues [18] to guide the learning process. Feature ablation studies demonstrated the usefulness of aggregating different combinations of feature sets using the elastic-net (EN) regularizer to improve mlLGPR prediction performance on golden datasets. Using this process we determined that abundance (AB), enzymatic reaction evidence (RE) and pathway evidence (PE) feature sets contribute disproportionately to mlLGPR-EN performance. After tuning several hyper-parameters to further improve mlLGPR performance, pathway prediction outcomes were compared to other methods including MinPath and PathoLogic. The results indicated that while mlLGPR-EN performance equaled or exceeded other methods including PathoLogic on organismal genomes, its performed more marginally on complex datasets. This is likely due to multiple factors including the limited validation information for community-level metabolism as well as the need for more subtle features engineering and algorithmic improvements.

Several issues were identified during testing and implementation that need to be resolved for improved pathway prediction outcomes using machine learning methods. While rich feature information is integral to mlLGPR performance, the current definition of feature sets relies on manual curation based on prior knowledge. We observed that in some instances the features engineering process is susceptible to noise resulting in low performance scores. Moreover, individual enzymes may participate in multiple pathways, e.g. multiple mapping problem, resulting in increased false discovery without additional feature sets that relate the presence and abundance of EC numbers to other factors. This problem has been partially addressed by designing features based on side knowledge of a pathway, such as information about “key-reactions” in pathways that increase the likelihood that a given pathway is present. Additional factors including taxonomy, gene expression, or environmental context should also be considered in features engineering for specific information structures. For example, taxonomic constraints on metabolic potential are difficult to use when predicting pathways at the community level given the limited number of closed genomes present in the data. In contrast, environmental context information such as physical and chemical parameter data could be used to constrain specific metabolic potential e.g. aerobic versus anaerobic or light- versus dark-dependent processes. Missing EC numbers also present a challenge especially when trying to define “key-reactions” in pathways with less biochemical validation. An alternative method might be to apply representational learning [43], e.g. learning features from data automatically that can be supplemented with side knowledge to improve pathway prediction outcomes. Finally, alternative algorithms used to analyze high dimensional datasets such as graph based learning [44] has potential to provide even more accurate models needed to inform future experimental design and pathway engineering efforts.

Supporting information

S1 Appendix. Mathematical derivations of mlLGPR.

This file describes the process of deriving the objective cost function in Eq 9.

https://doi.org/10.1371/journal.pcbi.1008174.s001

(PDF)

S2 Appendix. Features used for mlLGPR.

This file describes features engineering aspects of the work. Given a set of enzymatic reactions with abundance information, we extract sets of features to capture salient aspects of metabolism for pathway inference.

https://doi.org/10.1371/journal.pcbi.1008174.s002

(PDF)

S3 Appendix. Additional experiments.

This file contains additional test results that are not presented in the main article including more in-depth information related to datasets and the ELA robustness metric.

https://doi.org/10.1371/journal.pcbi.1008174.s003

(PDF)

S1 Table. Pathway abundance information from symbiont data.

MetaCyc Pathway ID: The unique identifier for the pathway as provided by MetaCyc; MetaCyc Pathway Name: The name of the pathway as outlined by MetaCyc; Moranella: the Moranella endosymbiont (GenBank NC-015735); Tremblaya: the Tremblaya endosymbiont (GenBank NC-015736); Composite: a composite genome consisting of both endosymbiont genomes. Each numeric value encodes the coverage information of a pathway associated with each endosymbiont or composite genome. The coverage is computed based on mapping enzymes onto true representations of each pathway and is within the range of [0, 1], where 1 indicates that all enzymes catalyzing reactions in a given pathway were identified while 0 means no enzymes were observed for a given pathway.

https://doi.org/10.1371/journal.pcbi.1008174.s004

(TSV)

S2 Table. Pathway abundance information from HOTS data.

MetaCyc Pathway ID: The unique identifier for the pathway as provided by MetaCyc; MetaCyc Pathway Name: The name of the pathway as outlined by MetaCyc; 25m: the 25 m depth interval in the HOTS water column; 75m: the 75m depth interval in the HOTS water column; 110m: the 110 m depth interval in the HOTS water column; and 500m: the 500 m depth interval in the HOTS water column. Each numeric value encodes abundance information for a given pathway associated with each depth interval. The abundance is expected pathway copies normalized based on mapping identified enzymes onto true representations of each selected pathway.

https://doi.org/10.1371/journal.pcbi.1008174.s005

(TSV)

Acknowledgments

We would like to thank Connor Morgan-Lang, Julia Glinos, Kishori Konwar and Aria Hahn for lucid discussions on the function of the mlLGPR model and all members of the Hallam Lab for helpful comments along the way.

References

  1. 1. Oltvai ZN, Barabási AL. Life’s complexity pyramid. Science. 2002;298(5594):763–764.
  2. 2. Hahn AS, Konwar KM, Louca S, Hanson NW, Hallam SJ. The information science of microbial ecology. Current opinion in microbiology. 2016;31:209–216. pmid:27183115
  3. 3. Toubiana D, Puzis R, Wen L, Sikron N, Kurmanbayeva A, Soltabayeva A, et al. Combined network analysis and machine learning allows the prediction of metabolic pathways from tomato metabolomics data. Communications Biology. 2019;2(1):214. pmid:31240252
  4. 4. Ansorge WJ. Next-generation DNA sequencing techniques. New biotechnology. 2009;25(4):195–203. pmid:19429539
  5. 5. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research. 2017;45(D1):D353–D361. pmid:27899662
  6. 6. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The reactome pathway knowledgebase. Nucleic acids research. 2017;46(D1):D649–D655.
  7. 7. Caspi R, Billington R, Keseler IM, Kothari A, Krummenacker M, Midford PE, et al. The MetaCyc database of metabolic pathways and enzymes-a 2019 update. Nucleic acids research. 2019.
  8. 8. Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, et al. The BioCyc collection of microbial genomes and metabolic pathways. Briefings in Bioinformatics. 2017;20(4):1085–1093.
  9. 9. Karp PD, Paley S, Romero P. The pathway tools software. Bioinformatics. 2002;18(suppl_1):S225–S232. pmid:12169551
  10. 10. Karp PD, Latendresse M, Paley SM, Krummenacker M, Ong QD, Billington R, et al. Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology. Briefings in bioinformatics. 2016;17(5):877–890. pmid:26454094
  11. 11. Karp PD, Ong WK, Paley S, Billington R, Caspi R, Fulcher C, et al. The EcoCyc Database. EcoSal Plus. 2018;8(1). pmid:30406744
  12. 12. Caspi R, Billington R, Foerster H, Fulcher CA, Keseler I, Kothari A, et al. BioCyc: Online Resource for Genome and Metabolic Pathway Analysis. The FASEB Journal. 2016;30(1 Supplement):lb192–lb192.
  13. 13. Konwar KM, Hanson NW, Pagé AP, Hallam SJ. MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information. BMC bioinformatics. 2013;14(1):202. pmid:23800136
  14. 14. Hanson NW, Konwar KM, Hawley AK, Altman T, Karp PD, Hallam SJ. Metabolic pathways for the whole community. BMC genomics. 2014;15(1):1.
  15. 15. Konwar KM, Hanson NW, Bhatia MP, Kim D, Wu SJ, Hahn AS, et al. MetaPathways v2. 5: quantitative functional, taxonomic and usability improvements. Bioinformatics. 2015;31(20):3345–3347. pmid:26076725
  16. 16. Hahn AS, Altman T, Konwar KM, Hanson NW, Kim D, Relman DA, et al. A geographically-diverse collection of 418 human gut microbiome pathway genome databases. Scientific Data. 2017;4.
  17. 17. Ye Y, Doak TG. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol. 2009;5(8):e1000465. pmid:19680427
  18. 18. Dale JM, Popescu L, Karp PD. Machine learning methods for metabolic pathway prediction. BMC bioinformatics. 2010;11(1):1.
  19. 19. Carbonell P, Wong J, Swainston N, Takano E, Turner NJ, Scrutton NS, et al. Selenzyme: Enzyme selection tool for pathway design. Bioinformatics. 2018;34(12):2153–2154. pmid:29425325
  20. 20. Delépine B, Duigou T, Carbonell P, Faulon JL. RetroPath2. 0: A retrosynthesis workflow for metabolic engineers. Metabolic engineering. 2018;45:158–170. pmid:29233745
  21. 21. Tabei Y, Yamanishi Y, Kotera M. Simultaneous prediction of enzyme orthologs from chemical transformation patterns for de novo metabolic pathway reconstruction. Bioinformatics. 2016;32(12):i278–i287. pmid:27307627
  22. 22. Shafiei M, Dunn KA, Chipman H, Gu H, Bielawski JP. BiomeNet: A Bayesian model for inference of metabolic divergence among microbial communities. PLoS Comput Biol. 2014;10(11):e1003918. pmid:25412107
  23. 23. Jiao D, Ye Y, Tang H. Probabilistic inference of biochemical reactions in microbial communities from metagenomic sequences. PLoS Comput Biol. 2013;9(3):e1002981. pmid:23555216
  24. 24. McCutcheon JP, Von Dohlen CD. An interdependent metabolic patchwork in the nested symbiosis of mealybugs. Current Biology. 2011;21(16):1366–1372. pmid:21835622
  25. 25. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature methods. 2017;14(11):1063.
  26. 26. Stewart FJ, Sharma AK, Bryant JA, Eppley JM, DeLong EF. Community transcriptomics reveals universal patterns of protein sequence conservation in natural microbial communities. Genome biology. 2011;12(3):R26. pmid:21426537
  27. 27. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
  28. 28. Walt Svd, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering. 2011;13(2):22–30.
  29. 29. Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM (United States); 2008.
  30. 30. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python. arXiv e-prints. 2019; p. arXiv:1907.10121.
  31. 31. Bairoch A. The ENZYME database in 2000. Nucleic acids research. 2000;28(1):304–305. pmid:10592255
  32. 32. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition. 2012;45(9):3084–3104.
  33. 33. Zhang ML, Zhou ZH. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering. 2014;26(8):1819–1837.
  34. 34. Wan S, Mak MW, Kung SY. mPLR-Loc: An adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Analytical biochemistry. 2015;473:14–27. pmid:25449328
  35. 35. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320.
  36. 36. Perkins S, Lacker K, Theiler J. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of machine learning research. 2003;3(Mar):1333–1356.
  37. 37. Lee JD, Panageas I, Piliouras G, Simchowitz M, Jordan MI, Recht B. First-order Methods Almost Always Avoid Saddle Points. arXiv preprint arXiv:171007406. 2017.
  38. 38. Bertsimas D, Tsitsiklis JN. Introduction to linear optimization. vol. 6. Athena Scientific Belmont, MA; 1997.
  39. 39. Sechidis K, Tsoumakas G, Vlahavas I. On the stratification of multi-label data. Machine Learning and Knowledge Discovery in Databases. 2011; p. 145–158.
  40. 40. Wu XZ, Zhou ZH. A Unified View of Multi-Label Performance Measures. arXiv preprint arXiv:160900288. 2016.
  41. 41. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. 2nd ed. Springer; 2009.
  42. 42. Sáez JA, Luengo J, Herrera F. Evaluating the Classifier Behavior with Noisy Data Considering Performance and Robustness. Neurocomput. 2016;176(C):26–35.
  43. 43. Bengio Y, Courville A, Vincent P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(8):1798–1828. pmid:23787338
  44. 44. Shi C, Li Y, Zhang J, Sun Y, Philip SY. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering. 2017;29(1):17–37.