Prediction of HIV drug resistance based on the 3D protein structure: Proposal of molecular field mapping

Ryosaku Ota; Kanako So; Masahiro Tsuda; Yuriko Higuchi; Fumiyoshi Yamashita

doi:10.1371/journal.pone.0255693

Abstract

A method for predicting HIV drug resistance by using genotypes would greatly assist in selecting appropriate combinations of antiviral drugs. Models reported previously have had two major problems: lack of information on the 3D protein structure and processing of incomplete sequencing data in the modeling procedure. We propose obtaining the 3D structural information of viral proteins by using homology modeling and molecular field mapping, instead of just their primary amino acid sequences. The molecular field potential parameters reflect the physicochemical characteristics associated with the 3D structure of the proteins. We also introduce the Bayesian conditional mutual information theory to estimate the probabilities of occurrence of all possible protein candidates from an incomplete sequencing sample. This approach allows for the effective use of uncertain information for the modeling process. We applied these data analysis techniques to the HIV-1 protease inhibitor dataset and developed drug resistance prediction models with reasonable performance.

Citation: Ota R, So K, Tsuda M, Higuchi Y, Yamashita F (2021) Prediction of HIV drug resistance based on the 3D protein structure: Proposal of molecular field mapping. PLoS ONE 16(8): e0255693. https://doi.org/10.1371/journal.pone.0255693

Editor: Jed N. Lampe, University of Colorado Denver Skaggs School of Pharmacy and Pharmaceutical Sciences, UNITED STATES

Received: March 5, 2021; Accepted: July 21, 2021; Published: August 4, 2021

Copyright: © 2021 Ota et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The original and preprocessed data used to build the models will be available at: https://osf.io/gqwt3/.

Funding: This work was financially supported in part by the Japan Society for the Promotion of Science in the form of grants award to RO (KAKENHI Grant Number JP20J15557) and MT (KAKENHI Grant Number JP18K06747). Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Drug-resistant viruses have a significant impact on the prognosis of HIV infections [1, 2]. Predicting drug resistance from their genotypes would allow the selection of appropriate drugs for efficient treatment. The development of such prediction models has been actively promoted [3–7], along with growing databases, such as the Stanford HIV Drug Resistance Database, which collects protein information and evaluates the resistance of drug-resistant viruses [4, 8].

These prediction models include classification or regression models by using various machine learning methods (support vector machine, deep learning, etc.). The models are, unexceptionally, based on the primary sequence of a protein, which is converted into numerical descriptors by means of one-hot encoding or similar techniques and used as predictive variables. Margaret et al. [3] developed a classification model with a high accuracy of approximately 0.9 by using a deep learning technique and detected mutations responsible for drug resistance. Geno2pheno, developed by Niko et al. [4], addressed the regression problem based on a support vector regression (SVR) method and provided a determination coefficient of approximately 0.7. These models are available on websites to easily predict the drug resistance profiles of viral variants.

However, the proposed models have two problems. First, their models are based on the primary protein structure, and therefore lacking information on the 3D structure. Protein function is derived from the tertiary structure, as is commonly understood in structural bioinformatics [9], and it is conceivable that the 3D structure is closely related to drug resistance. Second, there has been a problem in processing incomplete sequencing samples during the modeling procedure. Incomplete sequencing provides multiple candidates at an amino acid position. The authors have views on whether to disregard these data samples [3] or list all possible combinations for modeling use [6, 7]. The former might lead to limited learning with smaller sample sizes, while the latter might overwhelm a model with uncertain data.

With these issues in mind, we propose a novel technique to model drug resistance of HIV. The structural information of HIV-1 variants was encoded as 3D molecular field parameters by conducting homology modeling, structural alignment, and molecular field mapping. The molecular field mapping approach is based on comparative molecular field analysis (CoMFA) [10]. CoMFA is a 3D-QSAR method developed by Cramer et al. that embeds the molecule of interest in a grid lattice and calculates the interaction (i.e., steric and electrostatic potentials) between the probe atoms of each grid point and the molecule. The calculated molecular field parameters are used as features associated with the physical structure of each molecule. CoMFA has been generally used for QSAR analysis of low-molecular weight drugs but, in this study, it was applied to the homologous drug target (i.e., viral protein variants) in a reverse direction. Each possible candidate from incomplete gene sequencing was weighted in the modeling process based on their probabilities of occurrence according to the Bayesian conditional mutual information theory. Partial least squares (PLS) [11], random forest (RF) [12], LightGBM (LGBM) [13], and support vector regression (SVR) [14] were employed to construct the prediction model. We visualized the structural regions with high probability to drug resistance. The flow of the entire analysis was illustrated using several HIV-1 protease inhibitors.

Methods

Data collection

HIV-1 drug resistance data, along with the primary structure of HIV-1 protease variants, were obtained from Stanford University’s HIV Drug Resistance Database [8]. Resistance to HIV-1 protease inhibitors was represented by the fold change (FC) increase of the IC50 compared to wild type HIV. In addition, following other papers [3, 15], drug resistance was binarized with a threshold FC of 3.5, where viruses with a higher score were classified as resistant. The number of screening samples was 1059, 665, 1560, 1608, 1373, 1655, 1604, and 766 for atazanavir, darunavir, fosamprenavir, indinavir, lopinavir, nelfinavir, saquinavir, and tipranavir, respectively, of which the number of “complete” sequencing samples was 463, 264, 726, 759, 600, 781, 758, and 302, respectively.

Treatment of incomplete sequencing samples

The data collection contained incomplete sequencing samples, such that multiple candidates existed at certain amino acid positions. We considered the uncertainty of all possible combinations based on their probabilities of occurrence (Fig 1).

Download:

Fig 1. Scheme of treatment of incomplete sequencing samples.

All possible combinations were listed from incomplete sequencing samples, and their information was weighted by the conditional probability of occurrence. The conditional probability of occurrence of each amino acid at each position was determined based on amino acid sequences of complete sequencing samples.

https://doi.org/10.1371/journal.pone.0255693.g001

The conditional probability of occurrence was calculated for each of the multiple candidates at position i, given the type of amino acids at the m positions. This idea is based on the concept of population genetics, in which viruses prone to drug resistance have a similar genetic sequence [16].

We consider the type of amino acids as a random variable and let X_i∈(A,R,N,⋯,Y) be the i-th random variable in the amino acid sequence. According to the marginal probability density function p(X_i) and the simultaneous probability distribution function p(X_i,X_j), information entropy H(X_i), mutual information content I(X_i,X_j), and normalized mutual information NMI(X_i,X_j) [17] are expressed as follows: (1) (2) (3)

The NMIs for all pairs were determined from a collection of complete sequencing samples. A permutation test [18] was performed to determine the statistical significance (P<0.05) for the NMIs. For each X_i, a set of statistically relevant positions in the sequence (X_m,i) was determined. X_m,i was limited to the top 10 if the length was greater than 10. Finally, the conditional probability p(X_i = x_i|X_m,i = x_m) was calculated.

Calculation of molecular field potentials

Modeller 9.20 [19] was used for homology modeling [20]. The template candidates for each HIV-1 protease inhibitor are summarized in S1 Table. The crystal structures of HIV-1 protease proven to bind to the drugs were obtained from the PDB database and used as a template. In addition, mufft v7.427 [21] was used to align the primary sequence with the templates. For each protease variant, the template with the highest percent sequence identity (PID) was used for automated homology modeling using Modeller [22]. The atomic charge of each atom of the proteases was calculated using PDB2PQR version 2.1.1 [23].

Before the molecular field analysis, all homology-modeled protease variants were superimposed on one another [24–35]. The molecular field parameters (i.e., steric and electrostatic potentials) of each variant were calculated using a method similar to that described by Cramer et al. [10]. The protease structure was embedded in a 2 Å spacing lattice. A probe atom with the van der Waals properties of sp³ carbon and a charge of +1 was placed at each grid point. The steric and electrostatic interaction energies in the sum between the probe atom and each atom of the protease variants were calculated as the van der Waals [36] and Coulomb’s potentials [37], respectively.

Construction of prediction models

Data were randomly divided into two parts: 80% for training and 20% for external evaluation. The feature extraction and hyperparameter selection described later were adopted for the training data. A prediction model was developed using PLS, RF, LGBM, and SVR with molecular field parameters as predictors and drug resistance as the output. PLS was implemented in Python 3.7 (S1 File). LGBM model were constructed using LightGBM 2.3.0, while RF and SVR using scikit-learn 0.23.1.

Feature selection

Feature selection was conducted using only complete sequencing samples. The first concern was that the calculation of the molecular field parameters diverged when the distance between the probe and target atoms was too close. In the first feature selection stage, the lower and upper limits were set to the 5th percentile of the dataset (cutoff_5%) and the 95th percentile of the dataset (cutoff_95%), respectively. In addition, the molecular field parameters with a heavily biased distribution of potential energies within the samples were removed from modeling, where the absolute skewness was greater than 2.5. The molecular field parameters with a standard deviation of 2 kcal/mol or less were also removed [38].

In the secondary feature selection stage, the molecular field parameter dataset was subjected to machine learning-based recursive feature elimination [39]. The importance of the molecular field parameters as a feature of modeling was estimated using the linear model (scikit-learn) of SVR. Feature selection for the steric and electrostatic molecular field parameters was performed independently. For the steric potential, the unimportant data was removed recursively two at a time, until the remaining number reached one-half of the first stage. On the other hand, in the case of the electrostatic potential, the data were removed recursively ten at a time, until the remaining number reached one-eighth.

Machine learning models

The loss function and evaluation function (weighted determination coefficients) of the regression model are defined as follows: where y_obs,i and are the observed log₁₀ FC of IC₅₀ for sample i and their average, respectively, and y_pred,ij and p_ij are the log₁₀ FC and conditional probability for the j-th candidate of sample i, respectively.

Hyperparameter optimization

For each machine learning model, the corresponding hyperparameters were optimized using Optuna version 2.0.0 [40], which implements the tree-structured Parzen estimator (TPE) algorithm. The samples were divided into three groups for 3-fold cross-validation [41]. One group was used for validation data, while the remaining two were used as training data. The training data were preprocessed as described in the previous section and then normalized. The validation data were processed using the conditional parameters used for the training data. Next, each machine learning model was trained with the training data to minimize the loss function and, finally, evaluated with the validation data in terms of predictive R². The cross-validation was repeated three times, and the average of the predictive R² was used for the evaluation of hyperparameters. The hyperparameters were updated 30 times in Optuna.

Final evaluation

As mentioned above, 80% of the total data was used for training, and the remaining 20% was used for external evaluation. After preprocessing and normalizing the training data, each regression model was built using the hyperparameters determined during the cross-validation step. Regression models were tested using external data and evaluated by weighted determination coefficients. Setting the threshold of FC at 3.5, the goodness of the classification was evaluated with accuracy, precision, true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), false-negative rate (FNR), area under the ROC curve (AUC), and F1 score [42, 43].

Visualization of structural importance in drug resistance acquisition

Molecular field-based analysis allows visualization of sites of importance involved in acquisition of drug resistance. It should be noted that spatially neighboring grid points show relatively similar potential energies because the molecular field is continuous. The ability of PLS to construct models considering the collinearity of predictor variables is also useful for detecting regions of high susceptibility to drug resistance. Standardized partial regression coefficients obtained with PLS analysis show the degree of importance of drug resistance acquisition for each spatial coordinate. Upon setting certain thresholds (the 1st and 99th percentiles of the coefficient), a contour map was created in the 3D molecular field. First, grid points remaining after feature extraction were used for PLS modeling for drug resistance and subjected to PLS standardized partial regression coefficients. Next, for all grid points connected by the Delaunay algorithm, the interior points of the line segments of the mesh were searched where the indicated threshold score was given. Finally, a contour map was created by connecting points less than 5 Å apart and creating a new cluster. The figure was generated using Chimera version 1.15 [44].

The degree of agreement of the contour map with known major drug resistance-related amino acid positions (30, 32, 47, 48, 50, 54, 76, 82, 84, 88) [8] was assessed. Firstly, the distance to the nearest contour plane for all α-carbons of the protease was calculated, and their cumulative probability density distribution was obtained. Then, the simultaneous probability for the major drug resistance positions was estimated and compared with the probability for a random selection of the same number of amino acid positions. One hundred thousand random permutations were generated to estimate the ranking for the drug resistance positions.

Results and discussion

Sequence data collection and processing

Table 1 summarizes the sample sizes of the eight HIV-1 protease inhibitors listed in the database. The sample size varied among the drugs since they were not tested for all viral variants. A considerable number of incomplete sequencing samples were included in the database. Previous researchers have expanded them into all possible combinations of primary sequences [6, 7]. The incompleteness of gene sequencing might be associated with a mixture of viral variants or the detection limit of the Sanger method [45]. It should be noted that a simple listing of combinations would result in unreliable and erroneous information. Indeed, the number of all possible combinations in incomplete sequencing samples was much larger than that of complete sequencing samples. Being aware of the necessity of data weighting, we introduced the conditional probability for each sequence based on the concept of population genetics [16]. Using complete sequencing samples, we calculated the NMI for each pair of different positions in the sequence (Fig 2), performed a permutation test to determine the statistical significance of the NMIs, determined a set of up to 10 statistically relevant positions for each position, and finally obtained conditional probabilities for each occurrence. We weighted each possible combination with its conditional probabilities and estimated a substantial number of samples for each drug (Table 1). As a result, we roughly doubled the information compared to the complete sequencing samples.

Download:

Fig 2. Normalized mutual information on co-occurrence of amino acids between any two positions.

(A) Gray-scale image matrix of normalized mutual information (NMI). (b) Statistical significance of NMI determined by permutation test at P<0.05 (black).

https://doi.org/10.1371/journal.pone.0255693.g002

Download:

Table 1. Numbers of sequencing samples listed in the database for each HIV-1 protease inhibitor.

https://doi.org/10.1371/journal.pone.0255693.t001

Calculation of molecular fields and feature selection

The 3D structure of each HIV-1 protease variant was predicted using homology modeling. Since several crystal structures of drug-protease complexes are available for each drug, the homology modeling template of each protease variant was chosen according to the primary sequence similarity. After homology modeling, all variants were superimposed without reforming their 3D structure onto the most common structural template (the bold symbols in S1 Table). The structures were subjected to calculations of the steric and electrostatic molecular fields in the grid lattice. The molecular field parameters represent the structural similarity/dissimilarity of protease variants, which allow comparison from a physicochemical perspective.

The grid lattice covering the protease variants required 80 Å × 70 Å × 56 Å in size. In the case of setting the grid interval to 2.0 Å, both steric and electrostatic molecular field parameters brought the total number to 60,000. Feature selection were performed to increase the efficiency of machine learning, resulting in a reduction in the number of parameters to approximately 4,000 (S2 Table). Fig 3 shows the coordinates of the extracted features. The coordinates of the remaining molecular field parameters almost reflected the shape of the proteases, suggesting that the features were reasonably extracted.

Download:

Fig 3. Selected grid points of steric (left) and electrostatic (right) molecular field parameters in the analysis of drug resistance for each HIV-1 protease inhibitor.

The grid points were selected by preprocessing and SVR feature selection.

https://doi.org/10.1371/journal.pone.0255693.g003

Model performance

The data set was divided into training and external test datasets for each drug (S3 Table). PLS, LGBM, RF, and SVR, which are widespread and computationally less-intensive algorithms, were selected to build the prediction models. The ranges of hyperparameters in model optimization are given in S4 Table, and the results are summarized in S5 Table. In PLS, the number of principal components was determined by 3-fold cross-validation (S5 Table). The weighted determination coefficients and 3-fold cross-validated predictive determination coefficients for the training data are summarized in S6 and S7 Tables, respectively. In addition, predictive determination coefficients for the external test dataset that have never been used for training is summarized in Table 2. The scatter plots of observed and predicted log₁₀ FC are shown in S1–S8 Figs. The four models did not significantly differ in their prediction accuracy, but LGBM appeared to provide better prediction accuracy in all cases. A comparison of performance between the present and previous models would make sense, even though different or differently preprocessed datasets were analyzed. The present regression model performed better than Geno2pheno (predictive R² = 0.698) [4]. Geno2pheno has been developed based on a different, smaller dataset than that of the Stanford Drug Resistance Database. Unfortunately, the present model was slightly inferior to the model proposed by Shen et al. (predictive R² = 0.883) [7]. Both models used the Stanford Drug Resistance Database, listed all possible combinations from each incomplete sequence sample, and gave the same answer label for each. The difference was that we performed the weighting of the data according to their probability of occurrence, resulting in different predictive performances.

Download:

Table 2. Weighted determination coefficients for prediction in external test dataset (R²).

https://doi.org/10.1371/journal.pone.0255693.t002

The effectiveness of the weighting of the data based on their probability of occurrence was checked using the same training and external test datasets. Two additional types of prediction models were developed by treating the present training data set in a manner equivalent to the models of Geno2pheno and Shen et al.: that is, the former used only the complete sequencing data, and the latter used all combinations without considering their probability of occurrence. The prediction performance was evaluated on the complete sequencing data of the external test dataset. In any case, the best approach was to employ all combinations, considering the probability of occurrence (S8 Table).

Classification models have also been proposed to identify whether viral variants are drug-resistant [3]. For comparison, we used our regression model for classification purposes by defining an FC of 3.5 as a threshold. Table 3 summarizes the results of the external validation assessments. Multiple metrics were used to evaluate the predictive classification performance, which included metrics suitable for both balanced (e.g., accuracy, precision, true positive ratio) and imbalanced (e.g., F1) data. As with the regression purpose, the LGBM model was slightly better than or comparable to the other three machine learning models (Table 3, S9 Table). The accuracy of approximately 0.9 achieved by the models was as high as that of the previously reported classification model [3].

Download:

Table 3. Goodness of classification by the LGBM model^{^a}.

https://doi.org/10.1371/journal.pone.0255693.t003

The prediction of drug resistance to tipranavir appeared to be less accurate than that for other drugs. The poor prediction accuracy of tipranavir might be associated partly with an imbalance in number between drug-susceptible and drug-resistant variants (1.0 vs 0.19). As machine learning is a data-dependent analysis, this could simply be a result of chance. However, it might be interesting to note that tipranavir has a slightly different mode of HIV-1 protease inhibition. HIV-1 protease forms a dimer with a catalytic site between the two units [46]. Although most HIV-1 protease inhibitors bind to the catalytic site, tipranavir can also inhibit the formation of the dimer itself [47]. Unfortunately, the present model could not predict the latter activity. Darunavir also possesses the same mechanism as tipranavir [47, 48], but its inhibitory effect was well predicted by the current model. Considering that the primary mechanism of the two drugs is yet to be determined, the limitations of the current model need to be considered.

Structural factor analysis of drug resistance

The drug resistance of the virus can be attributed to structural changes in viral proteins associated with their mutations. Characterization of the structural effects of protein mutations would be of great use in drug discovery [47]. The proposed models are based on the CoMFA approach, which allows 3D mapping of the degree of importance of the molecular field parameters in the target property/activity [10, 49]. CoMFA has generally been used to analyze and predict the target binding or bioactivity of a series of small-molecule drugs. The implemented PLS algorithm detects the link between the substituents around the core scaffold and the target property. Our model reversed regular CoMFA models, where structurally homologous protein variants that recognize the same drug were superimposed and analyzed. We expected to quantitatively evaluate how much drug resistance was affected by changes in molecular fields associated with the protein mutation.

A contour map generated by PLS analysis of molecular fields allows the identification of structural impacts on drug resistance acquisition (Fig 4). Fig 4 also shows the structure of the wild-type protease complexed with lopinavir. Green contours indicate regions in which the steric interaction of the protease increases drug resistance. In contrast, yellow contours indicate regions where the steric interaction lowers drug resistance (i.e., drug resistance increases as the steric interaction decreases). The yellow contours were located near the drug-protein binding interface. The molecular dynamics simulations of Wang et al. [50] revealed that the enlargement of the binding pocket by amino acid mutations weakens the binding of inhibitors. On the other hand, the green contours are located at the peripheries of the protein, where drug-protein interactions are unlikely to occur. We assume that these regions could be a counterpart to the yellow regions. This means that enlargement of the binding pocket might enhance steric interactions at the periphery. The standardized partial regression coefficients of the electrostatic molecular field parameters were much smaller than those of the steric molecular field parameters (data not shown). This suggests that changes in electrostatic interactions due to mutations might be less involved in drug resistance acquisition.

Download:

Fig 4. Contour map of steric effects in drug resistance acquisition.

Contours were generated based on PLS standardized partial regression coefficients. Yellow and green contours indicate 1^st and 99^th percentiles of standardized partial regression coefficients, respectively. Steric interaction of the protease with yellow regions negatively affects drug resistance acquisition, whereas green regions show a positive effect. Dimerized wild-type HIV-1 protease (gray ribbon) and lopinavir (wireframe) are shown in the same figure. Pink indicates amino acids involved in drug resistance.

https://doi.org/10.1371/journal.pone.0255693.g004

The pink regions of the main chain indicate the position of amino acids in which the mutation confers drug resistance. To investigate how accurately the contour map explains the known drug resistance positions, the distance between the contours and the drug resistance positions was investigated. After estimating the probability distribution of the distance between the α-carbon and the nearest contour, the simultaneous probability for the selection of a set of the major drug resistance positions was calculated and compared to that for a random selection of the same number of amino acid positions. In terms of proximity to the contour planes, the set of the major drug resistance positions ranked in the top 0.316% of 100,000 random permutations. It indicates that known drug resistance-associated amino acids are positioned at the vicinity of the contour maps. Thus, the present 3D-based analysis reasonably represents the structure-activity relationship in the acquisition of drug resistance by viral mutation.

Conclusion

In this study, we successfully developed a predictive model for HIV drug resistance with reasonable prediction accuracy based on the 3D structure of HIV protease variants. The proposed method can also be applied to predict the resistance to other anti-HIV agents, such as reverse transcriptase inhibitors and integrase inhibitors. It should be noted that the steps of homology modeling and machine learning processes are computationally intensive. Considering that reverse transcriptase and integrase are more than twice as large as protease, it would be difficult to apply the current approach in a limited computational environment.

Supporting information

S1 Table. Template proteins for homology modeling of HIV protease variants.

There are several available complex structures with HIV protease variants for each drug. For each protease variant to be subjected to homology modeling, one of the protein structures listed was selected as a template according to the similarity of primary amino acid sequences.

https://doi.org/10.1371/journal.pone.0255693.s001

(DOCX)

S2 Table. Number of molecular field energies selected as a feature of machine learning.

a) Grid sizes were set to embed an overlaid aggregate of all HIV protease variants. b) Numbers of steric and electrostatic potential energies.

https://doi.org/10.1371/journal.pone.0255693.s002

(DOCX)

S3 Table. Sample size of training and external test datasets for each drug.

https://doi.org/10.1371/journal.pone.0255693.s003

(DOCX)

S4 Table. Ranges of hyperparameters in model optimization.

https://doi.org/10.1371/journal.pone.0255693.s004

(DOCX)

S5 Table. Optimized hyperparameters of LightGBM (LGBM), Random Forest Regression (RF), Support Vector Regression (SVR), and Partial Least Squares (PLS) models for each drug.

a) Abbreviations: ATV, atazanavir; DRV, darunavir; FPV, fosamprenavir; IDV, indinavir; LPV, lopinavir; NFV, nelfinavir; SQV, saquinavir; TPV, tipranavir.

https://doi.org/10.1371/journal.pone.0255693.s005

(DOCX)

S6 Table. Weighted determination coefficients for prediction in training dataset (R²).

https://doi.org/10.1371/journal.pone.0255693.s006

(DOCX)

S7 Table. Weighted determination coefficients for prediction of 3-fold cross-validation^a).

a) Data represents mean ± SEM.

https://doi.org/10.1371/journal.pone.0255693.s007

(DOCX)

S8 Table. Effect of weighting of training data on external prediction in Lopinavir.

a) A: All combinations, considering their probability of occurrence, B: All combinations without considering their probability of occurrence, C: Only the complete sequencing data. b) Mean of three runs.

https://doi.org/10.1371/journal.pone.0255693.s008

(DOCX)

S9 Table. Goodness of classification of Random Forest Regression, Support Vector Regression, and Partial Least Squares models^a).

a) Goodness of classification was evaluated upon defining the threshold as a fold change of 3.5. b) Abbreviations: TPR, true positive ratio; TNR, true negative ratio; FPR, false positive ratio; FNR, false negative ratio; AUC, area under the ROC curve.

https://doi.org/10.1371/journal.pone.0255693.s009

(DOCX)

S1 Fig. Scatter plot of observed and predicted drug resistance in atazanavir.

https://doi.org/10.1371/journal.pone.0255693.s010

(TIF)

S2 Fig. Scatter plot of observed and predicted drug resistance in darunavir.

https://doi.org/10.1371/journal.pone.0255693.s011

(TIF)

S3 Fig. Scatter plot of observed and predicted drug resistance in fosamprenavir.

https://doi.org/10.1371/journal.pone.0255693.s012

(TIF)

S4 Fig. Scatter plot of observed and predicted drug resistance in indinavir.

https://doi.org/10.1371/journal.pone.0255693.s013

(TIF)

S5 Fig. Scatter plot of observed and predicted drug resistance in lopinavir.

https://doi.org/10.1371/journal.pone.0255693.s014

(TIF)

S6 Fig. Scatter plot of observed and predicted drug resistance in nelfinavir.

https://doi.org/10.1371/journal.pone.0255693.s015

(TIF)

S7 Fig. Scatter plot of observed and predicted drug resistance in saquinavir.

https://doi.org/10.1371/journal.pone.0255693.s016

(TIF)

S8 Fig. Scatter plot of observed and predicted drug resistance in tipranavir.

https://doi.org/10.1371/journal.pone.0255693.s017

(TIF)

S1 File. Source code of Partial Least Squares.

https://doi.org/10.1371/journal.pone.0255693.s018

(TXT)

References

1. Kuritzkes DR. Drug resistance in HIV-1. Curr Opin Virol. 2011;1(6):582–9. pmid:22162985
- View Article
- PubMed/NCBI
- Google Scholar
2. Günthard HF, Calvez V, Paredes R, Pillay D, Shafer RW, Wensing AM, et al. Human Immunodeficiency Virus Drug Resistance: 2018 Recommendations of the International Antiviral Society–USA Panel. Clin Infect Dis. 2019;68(2):177–87. pmid:30052811
- View Article
- PubMed/NCBI
- Google Scholar
3. Steiner MC, Gibson KM, Crandall KA. Drug Resistance Prediction Using Deep Learning Techniques on HIV-1 Sequence Data. Viruses. 2020;12(5):560. pmid:32438586
- View Article
- PubMed/NCBI
- Google Scholar
4. Beerenwinkel N, Däumer M, Oette M, Korn K, Hoffmann D, Kaiser R, et al. Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Res. 2003;31(13):3850–5. pmid:12824435
- View Article
- PubMed/NCBI
- Google Scholar
5. Tarasova O, Biziukova N, Filimonov D, Poroikov V. A Computational Approach for the Prediction of HIV Resistance Based on Amino Acid and Nucleotide Descriptors. Molecules. 2018;23(11):2751. pmid:30355996
- View Article
- PubMed/NCBI
- Google Scholar
6. Yu X, Weber IT, Harrison RW. Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure. Bmc Genomics. 2014;15(Suppl 5):S1. pmid:25081370
- View Article
- PubMed/NCBI
- Google Scholar
7. Shen C, Yu X, Harrison RW, Weber IT. Automated prediction of HIV drug resistance from genotype data. Bmc Bioinformatics. 2016;17(Suppl 8):278. pmid:27586700
- View Article
- PubMed/NCBI
- Google Scholar
8. Rhee S-Y, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003;31(1):298–303. pmid:12520007
- View Article
- PubMed/NCBI
- Google Scholar
9. Dorn M, Silva MB e, Buriol LS, Lamb LC. Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem. 2014;53:251–76.
- View Article
- Google Scholar
10. Cramer RD, Patterson DE, Bunce JD. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc. 1988;110(18):5959–67. pmid:22148765
- View Article
- PubMed/NCBI
- Google Scholar
11. Lindgren F, Geladi P, Wold S. The kernel algorithm for PLS. J Chemometr. 1993;7(1):45–59.
- View Article
- Google Scholar
12. Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32.
- View Article
- Google Scholar
13. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. 2017;3147–55.
- View Article
- Google Scholar
14. Smola AJ, Schölkopf B. A tutorial on support vector regression. Stat Comput. 2004;14(3):199–222.
- View Article
- Google Scholar
15. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, et al. Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype. Proc National Acad Sci. 2002;99(12):8271–6. pmid:12060770
- View Article
- PubMed/NCBI
- Google Scholar
16. Rogers AR, Huff C. Linkage Disequilibrium Between Loci With Unknown Phase. Genetics. 2009;182(3):839–44. pmid:19433632
- View Article
- PubMed/NCBI
- Google Scholar
17. Strehl A, Ghosh J. Cluster Ensembles–A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research. 2002; pmid:21479123
- View Article
- PubMed/NCBI
- Google Scholar
18. Pethel SD, Hahs DW. Exact Test of Independence Using Mutual Information. Entropy. 2014;16(5):2839–49.
- View Article
- Google Scholar
19. Fiser A, Šali A. Modeller: Generation and Refinement of Homology-Based Protein Structure Models. Methods Enzymol. 2003;374:461–91. pmid:14696385
- View Article
- PubMed/NCBI
- Google Scholar
20. Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Šali A. COMPARATIVE PROTEIN STRUCTURE MODELING OF GENES AND GENOMES. Annu Rev Bioph Biom. 2000;29(1):291–325. pmid:10940251
- View Article
- PubMed/NCBI
- Google Scholar
21. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30(4):772–80. pmid:23329690
- View Article
- PubMed/NCBI
- Google Scholar
22. May ACW. Percent Sequence Identity The Need to Be Explicit. Structure. 2004;12(5):737–8. pmid:15130466
- View Article
- PubMed/NCBI
- Google Scholar
23. Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. PDB2PQR: an automated pipeline for the setup of Poisson–Boltzmann electrostatics calculations. Nucleic Acids Res. 2004;32(suppl_2):W665–7. pmid:15215472
- View Article
- PubMed/NCBI
- Google Scholar
24. Sun J, Yu EY, Yang Y, Confer LA, Sun SH, Wan K, et al. Stn1–Ten1 is an Rpa2–Rpa3-like complex at telomeres. Gene Dev. 2009;23(24):2900–14. pmid:20008938
- View Article
- PubMed/NCBI
- Google Scholar
25. King NM, Prabu-Jeyabalan M, Bandaranayake RM, Nalam MNL, Nalivaika EA, Özen A, et al. Extreme Entropy–Enthalpy Compensation in a Drug-Resistant Variant of HIV-1 Protease. Acs Chem Biol. 2012;7(9):1536–46. pmid:22712830
- View Article
- PubMed/NCBI
- Google Scholar
26. Ishima R, Gong Q, Tie Y, Weber IT, Louis JM. Highly conserved glycine 86 and arginine 87 residues contribute differently to the structure and activity of the mature HIV‐1 protease. Proteins Struct Funct Bioinform. 2010;78(4):1015–25. pmid:19899162
- View Article
- PubMed/NCBI
- Google Scholar
27. Shen C, Wang Y, Kovalevsky AY, Harrison RW, Weber IT. Amprenavir complexes with HIV‐1 protease and its drug‐resistant mutants altering hydrophobic clusters. Febs J. 2010;277(18):3699–714. pmid:20695887
- View Article
- PubMed/NCBI
- Google Scholar
28. Mahalingam B, Wang Y, Boross PI, Tozser J, Louis JM, Harrison RW, et al. Crystal structures of HIV protease V82A and L90M mutants reveal changes in the indinavir‐binding site. Eur J Biochem. 2004;271(8):1516–24. pmid:15066177
- View Article
- PubMed/NCBI
- Google Scholar
29. Chen Z, Li Y, Chen E, Hall DL, Darke PL, Culberson C, et al. Crystal structure at 1.9-A resolution of human immunodeficiency virus (HIV) II protease complexed with L-735,524, an orally bioavailable inhibitor of the HIV proteases. J Biological Chem. 1994;269(42):26344–8. pmid:7929352
- View Article
- PubMed/NCBI
- Google Scholar
30. Clemente JC, Moose RE, Hemrajani R, Whitford LRS, Govindasamy L, Reutzel R, et al. Comparing the Accumulation of Active- and Nonactive-Site Mutations in the HIV-1 Protease †. Biochemistry-us. 2004;43(38):12141–51.
- View Article
- Google Scholar
31. Wong-Sam A, Wang Y-F, Zhang Y, Ghosh AK, Harrison RW, Weber IT. Drug Resistance Mutation L76V Alters Nonpolar Interactions at the Flap–Core Interface of HIV-1 Protease. Acs Omega. 2018;3(9):12132–40. pmid:30288468
- View Article
- PubMed/NCBI
- Google Scholar
32. Stoll V, Qin W, Stewart KD, Jakob C, Park C, Walter K, et al. X-ray crystallographic structure of ABT-378 (Lopinavir) bound to HIV-1 protease. Bioorgan Med Chem. 2002;10(8):2803–6. pmid:12057670
- View Article
- PubMed/NCBI
- Google Scholar
33. Kožíšek M, Bray J, Řezáčová P, Šašková K, Brynda J, Pokorná J, et al. Molecular Analysis of the HIV-1 Resistance Development: Enzymatic Activities, Crystal Structures, and Thermodynamics of Nelfinavir-resistant HIV Protease Mutants. J Mol Biol. 2007;374(4):1005–16. pmid:17977555
- View Article
- PubMed/NCBI
- Google Scholar
34. Serganov A, Huang L, Patel DJ. Structural insights into amino acid binding and gene control by a lysine riboswitch. Nature. 2008;455(7217):1263–7. pmid:18784651
- View Article
- PubMed/NCBI
- Google Scholar
35. Muzammil S, Armstrong AA, Kang LW, Jakalian A, Bonneau PR, Schmelmer V, et al. Unique Thermodynamic Response of Tipranavir to Human Immunodeficiency Virus Type 1 Protease Drug Resistance Mutations▿. J Virol. 2007;81(10):5144–54. pmid:17360759
- View Article
- PubMed/NCBI
- Google Scholar
36. Vinter JG, Davis A, Saunders MR. Strategic approaches to drug design. I. An integrated software framework for molecular modelling. J Comput Aid Mol Des. 1987;1(1):31–51. pmid:3505586
- View Article
- PubMed/NCBI
- Google Scholar
37. Wang X-S. Derivation of Coulomb’s Law of Forces Between Static Electric Charges Based on Spherical Source and Sink Model of Particles. Arxiv. 2006;
- View Article
- Google Scholar
38. Awasthi M, Singh S, Pandey VP, Dwivedi UN. CoMFA and CoMSIA-based designing of resveratrol derivatives as amyloid-beta aggregation inhibitors against Alzheimer’s disease. Med Chem Res. 2018;27(4):1167–85.
- View Article
- Google Scholar
39. Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn. 2002;46(1–3):389–422.
- View Article
- Google Scholar
40. Teredesai A, Kumar V, Li Y, Rosales R, Terzi E, Karypis G, et al. Optuna: A Next-generation Hyperparameter Optimization Framework. Applied Data Science Track Paper. 2019;2623–31.
- View Article
- Google Scholar
41. Kohavi R. A Study of Cross Validation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the 14th international joint conference on Artificial intelligence. 1995;
42. Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.
- View Article
- Google Scholar
43. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. Bmc Genomics. 2020;21(1):6. pmid:31898477
- View Article
- PubMed/NCBI
- Google Scholar
44. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, et al. UCSF Chimera—A visualization system for exploratory research and analysis. J Comput Chem. 2004;25(13):1605–12. pmid:15264254
- View Article
- PubMed/NCBI
- Google Scholar
45. França LTC, Carrilho E, Kist TBL. A review of DNA sequencing techniques. Q Rev Biophys. 2002;35(2):169–200. pmid:12197303
- View Article
- PubMed/NCBI
- Google Scholar
46. Todd MJ, Semo N, Freire E. The structural stability of the HIV-1 protease11Edited by P. E. Wright. J Mol Biol. 1998;283(2):475–88. pmid:9769219
- View Article
- PubMed/NCBI
- Google Scholar
47. Koh Y, Aoki M, Danish ML, Aoki-Ogata H, Amano M, Das D, et al. Loss of Protease Dimerization Inhibition Activity of Darunavir Is Associated with the Acquisition of Resistance to Darunavir by HIV-1. J Virol. 2011;85(19):10079–89. pmid:21813613
- View Article
- PubMed/NCBI
- Google Scholar
48. Hayashi H, Takamune N, Nirasawa T, Aoki M, Morishita Y, Das D, et al. Dimerization of HIV-1 protease occurs through two steps relating to the mechanism of protease dimerization inhibition by darunavir. Proc National Acad Sci. 2014;111(33):12234–9. pmid:25092296
- View Article
- PubMed/NCBI
- Google Scholar
49. Kim KH, Greco G, Novellino E. A critical review of recent CoMFA applications. Perspect Drug Discov. 1998;12–14(0):257–315.
- View Article
- Google Scholar
50. Wang R-G, Zhang H-X, Zheng Q-C. Revealing the binding and drug resistance mechanism of amprenavir, indinavir, ritonavir, and nelfinavir complexed with HIV-1 protease due to double mutations G48T/L89M by molecular dynamics simulations and free energy analyses. Phys Chem Chem Phys. 2020;22(8):4464–80. pmid:32057044
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Kuritzkes DR. Drug resistance in HIV-1. Curr Opin Virol. 2011;1(6):582–9. pmid:22162985
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Günthard HF, Calvez V, Paredes R, Pillay D, Shafer RW, Wensing AM, et al. Human Immunodeficiency Virus Drug Resistance: 2018 Recommendations of the International Antiviral Society–USA Panel. Clin Infect Dis. 2019;68(2):177–87. pmid:30052811
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Steiner MC, Gibson KM, Crandall KA. Drug Resistance Prediction Using Deep Learning Techniques on HIV-1 Sequence Data. Viruses. 2020;12(5):560. pmid:32438586
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Beerenwinkel N, Däumer M, Oette M, Korn K, Hoffmann D, Kaiser R, et al. Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Res. 2003;31(13):3850–5. pmid:12824435
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Tarasova O, Biziukova N, Filimonov D, Poroikov V. A Computational Approach for the Prediction of HIV Resistance Based on Amino Acid and Nucleotide Descriptors. Molecules. 2018;23(11):2751. pmid:30355996
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Yu X, Weber IT, Harrison RW. Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure. Bmc Genomics. 2014;15(Suppl 5):S1. pmid:25081370
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Shen C, Yu X, Harrison RW, Weber IT. Automated prediction of HIV drug resistance from genotype data. Bmc Bioinformatics. 2016;17(Suppl 8):278. pmid:27586700
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Rhee S-Y, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003;31(1):298–303. pmid:12520007
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Dorn M, Silva MB e, Buriol LS, Lamb LC. Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem. 2014;53:251–76.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref10] 10. Cramer RD, Patterson DE, Bunce JD. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc. 1988;110(18):5959–67. pmid:22148765
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Lindgren F, Geladi P, Wold S. The kernel algorithm for PLS. J Chemometr. 1993;7(1):45–59.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref12] 12. Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref13] 13. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. 2017;3147–55.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref14] 14. Smola AJ, Schölkopf B. A tutorial on support vector regression. Stat Comput. 2004;14(3):199–222.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref15] 15. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, et al. Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype. Proc National Acad Sci. 2002;99(12):8271–6. pmid:12060770
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Rogers AR, Huff C. Linkage Disequilibrium Between Loci With Unknown Phase. Genetics. 2009;182(3):839–44. pmid:19433632
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Strehl A, Ghosh J. Cluster Ensembles–A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research. 2002; pmid:21479123
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Pethel SD, Hahs DW. Exact Test of Independence Using Mutual Information. Entropy. 2014;16(5):2839–49.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref19] 19. Fiser A, Šali A. Modeller: Generation and Refinement of Homology-Based Protein Structure Models. Methods Enzymol. 2003;374:461–91. pmid:14696385
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref20] 20. Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Šali A. COMPARATIVE PROTEIN STRUCTURE MODELING OF GENES AND GENOMES. Annu Rev Bioph Biom. 2000;29(1):291–325. pmid:10940251
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref21] 21. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30(4):772–80. pmid:23329690
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. May ACW. Percent Sequence Identity The Need to Be Explicit. Structure. 2004;12(5):737–8. pmid:15130466
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref23] 23. Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. PDB2PQR: an automated pipeline for the setup of Poisson–Boltzmann electrostatics calculations. Nucleic Acids Res. 2004;32(suppl_2):W665–7. pmid:15215472
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref24] 24. Sun J, Yu EY, Yang Y, Confer LA, Sun SH, Wan K, et al. Stn1–Ten1 is an Rpa2–Rpa3-like complex at telomeres. Gene Dev. 2009;23(24):2900–14. pmid:20008938
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref25] 25. King NM, Prabu-Jeyabalan M, Bandaranayake RM, Nalam MNL, Nalivaika EA, Özen A, et al. Extreme Entropy–Enthalpy Compensation in a Drug-Resistant Variant of HIV-1 Protease. Acs Chem Biol. 2012;7(9):1536–46. pmid:22712830
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref26] 26. Ishima R, Gong Q, Tie Y, Weber IT, Louis JM. Highly conserved glycine 86 and arginine 87 residues contribute differently to the structure and activity of the mature HIV‐1 protease. Proteins Struct Funct Bioinform. 2010;78(4):1015–25. pmid:19899162
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref27] 27. Shen C, Wang Y, Kovalevsky AY, Harrison RW, Weber IT. Amprenavir complexes with HIV‐1 protease and its drug‐resistant mutants altering hydrophobic clusters. Febs J. 2010;277(18):3699–714. pmid:20695887
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref28] 28. Mahalingam B, Wang Y, Boross PI, Tozser J, Louis JM, Harrison RW, et al. Crystal structures of HIV protease V82A and L90M mutants reveal changes in the indinavir‐binding site. Eur J Biochem. 2004;271(8):1516–24. pmid:15066177
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref29] 29. Chen Z, Li Y, Chen E, Hall DL, Darke PL, Culberson C, et al. Crystal structure at 1.9-A resolution of human immunodeficiency virus (HIV) II protease complexed with L-735,524, an orally bioavailable inhibitor of the HIV proteases. J Biological Chem. 1994;269(42):26344–8. pmid:7929352
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref30] 30. Clemente JC, Moose RE, Hemrajani R, Whitford LRS, Govindasamy L, Reutzel R, et al. Comparing the Accumulation of Active- and Nonactive-Site Mutations in the HIV-1 Protease †. Biochemistry-us. 2004;43(38):12141–51.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref31] 31. Wong-Sam A, Wang Y-F, Zhang Y, Ghosh AK, Harrison RW, Weber IT. Drug Resistance Mutation L76V Alters Nonpolar Interactions at the Flap–Core Interface of HIV-1 Protease. Acs Omega. 2018;3(9):12132–40. pmid:30288468
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref32] 32. Stoll V, Qin W, Stewart KD, Jakob C, Park C, Walter K, et al. X-ray crystallographic structure of ABT-378 (Lopinavir) bound to HIV-1 protease. Bioorgan Med Chem. 2002;10(8):2803–6. pmid:12057670
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref33] 33. Kožíšek M, Bray J, Řezáčová P, Šašková K, Brynda J, Pokorná J, et al. Molecular Analysis of the HIV-1 Resistance Development: Enzymatic Activities, Crystal Structures, and Thermodynamics of Nelfinavir-resistant HIV Protease Mutants. J Mol Biol. 2007;374(4):1005–16. pmid:17977555
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref34] 34. Serganov A, Huang L, Patel DJ. Structural insights into amino acid binding and gene control by a lysine riboswitch. Nature. 2008;455(7217):1263–7. pmid:18784651
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref35] 35. Muzammil S, Armstrong AA, Kang LW, Jakalian A, Bonneau PR, Schmelmer V, et al. Unique Thermodynamic Response of Tipranavir to Human Immunodeficiency Virus Type 1 Protease Drug Resistance Mutations▿. J Virol. 2007;81(10):5144–54. pmid:17360759
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref36] 36. Vinter JG, Davis A, Saunders MR. Strategic approaches to drug design. I. An integrated software framework for molecular modelling. J Comput Aid Mol Des. 1987;1(1):31–51. pmid:3505586
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref37] 37. Wang X-S. Derivation of Coulomb’s Law of Forces Between Static Electric Charges Based on Spherical Source and Sink Model of Particles. Arxiv. 2006;
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref38] 38. Awasthi M, Singh S, Pandey VP, Dwivedi UN. CoMFA and CoMSIA-based designing of resveratrol derivatives as amyloid-beta aggregation inhibitors against Alzheimer’s disease. Med Chem Res. 2018;27(4):1167–85.
View Article
Google Scholar

[142] View Article

[143] Google Scholar

[ref39] 39. Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn. 2002;46(1–3):389–422.
View Article
Google Scholar

[145] View Article

[146] Google Scholar

[ref40] 40. Teredesai A, Kumar V, Li Y, Rosales R, Terzi E, Karypis G, et al. Optuna: A Next-generation Hyperparameter Optimization Framework. Applied Data Science Track Paper. 2019;2623–31.
View Article
Google Scholar

[148] View Article

[149] Google Scholar

[ref41] 41. Kohavi R. A Study of Cross Validation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the 14th international joint conference on Artificial intelligence. 1995;

[ref42] 42. Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.
View Article
Google Scholar

[152] View Article

[153] Google Scholar

[ref43] 43. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. Bmc Genomics. 2020;21(1):6. pmid:31898477
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref44] 44. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, et al. UCSF Chimera—A visualization system for exploratory research and analysis. J Comput Chem. 2004;25(13):1605–12. pmid:15264254
View Article
PubMed/NCBI
Google Scholar

[159] View Article

[160] PubMed/NCBI

[161] Google Scholar

[ref45] 45. França LTC, Carrilho E, Kist TBL. A review of DNA sequencing techniques. Q Rev Biophys. 2002;35(2):169–200. pmid:12197303
View Article
PubMed/NCBI
Google Scholar

[163] View Article

[164] PubMed/NCBI

[165] Google Scholar

[ref46] 46. Todd MJ, Semo N, Freire E. The structural stability of the HIV-1 protease11Edited by P. E. Wright. J Mol Biol. 1998;283(2):475–88. pmid:9769219
View Article
PubMed/NCBI
Google Scholar

[167] View Article

[168] PubMed/NCBI

[169] Google Scholar

[ref47] 47. Koh Y, Aoki M, Danish ML, Aoki-Ogata H, Amano M, Das D, et al. Loss of Protease Dimerization Inhibition Activity of Darunavir Is Associated with the Acquisition of Resistance to Darunavir by HIV-1. J Virol. 2011;85(19):10079–89. pmid:21813613
View Article
PubMed/NCBI
Google Scholar

[171] View Article

[172] PubMed/NCBI

[173] Google Scholar

[ref48] 48. Hayashi H, Takamune N, Nirasawa T, Aoki M, Morishita Y, Das D, et al. Dimerization of HIV-1 protease occurs through two steps relating to the mechanism of protease dimerization inhibition by darunavir. Proc National Acad Sci. 2014;111(33):12234–9. pmid:25092296
View Article
PubMed/NCBI
Google Scholar

[175] View Article

[176] PubMed/NCBI

[177] Google Scholar

[ref49] 49. Kim KH, Greco G, Novellino E. A critical review of recent CoMFA applications. Perspect Drug Discov. 1998;12–14(0):257–315.
View Article
Google Scholar

[179] View Article

[180] Google Scholar

[ref50] 50. Wang R-G, Zhang H-X, Zheng Q-C. Revealing the binding and drug resistance mechanism of amprenavir, indinavir, ritonavir, and nelfinavir complexed with HIV-1 protease due to double mutations G48T/L89M by molecular dynamics simulations and free energy analyses. Phys Chem Chem Phys. 2020;22(8):4464–80. pmid:32057044
View Article
PubMed/NCBI
Google Scholar

[182] View Article

[183] PubMed/NCBI

[184] Google Scholar

Figures

Abstract

Introduction

Methods

Data collection

Treatment of incomplete sequencing samples

Calculation of molecular field potentials

Construction of prediction models

Feature selection

Machine learning models

Hyperparameter optimization

Final evaluation

Visualization of structural importance in drug resistance acquisition

Results and discussion

Sequence data collection and processing

Calculation of molecular fields and feature selection

Model performance

Structural factor analysis of drug resistance

Conclusion

Supporting information

S1 Table. Template proteins for homology modeling of HIV protease variants.

S2 Table. Number of molecular field energies selected as a feature of machine learning.

S3 Table. Sample size of training and external test datasets for each drug.

S4 Table. Ranges of hyperparameters in model optimization.

S5 Table. Optimized hyperparameters of LightGBM (LGBM), Random Forest Regression (RF), Support Vector Regression (SVR), and Partial Least Squares (PLS) models for each drug.

S6 Table. Weighted determination coefficients for prediction in training dataset (R2).

S7 Table. Weighted determination coefficients for prediction of 3-fold cross-validationa).

S8 Table. Effect of weighting of training data on external prediction in Lopinavir.

S9 Table. Goodness of classification of Random Forest Regression, Support Vector Regression, and Partial Least Squares modelsa).

S1 Fig. Scatter plot of observed and predicted drug resistance in atazanavir.

S2 Fig. Scatter plot of observed and predicted drug resistance in darunavir.

S3 Fig. Scatter plot of observed and predicted drug resistance in fosamprenavir.

S4 Fig. Scatter plot of observed and predicted drug resistance in indinavir.

S5 Fig. Scatter plot of observed and predicted drug resistance in lopinavir.

S6 Fig. Scatter plot of observed and predicted drug resistance in nelfinavir.

S7 Fig. Scatter plot of observed and predicted drug resistance in saquinavir.

S8 Fig. Scatter plot of observed and predicted drug resistance in tipranavir.

S1 File. Source code of Partial Least Squares.

References

S6 Table. Weighted determination coefficients for prediction in training dataset (R²).

S7 Table. Weighted determination coefficients for prediction of 3-fold cross-validation^a).

S9 Table. Goodness of classification of Random Forest Regression, Support Vector Regression, and Partial Least Squares models^a).