Average Information Content Maximization—A New Approach for Fingerprint Hybridization and Reduction

Marek Śmieja; Dawid Warszycki

doi:10.1371/journal.pone.0146666

Abstract

Fingerprints, bit representations of compound chemical structure, have been widely used in cheminformatics for many years. Although fingerprints with the highest resolution display satisfactory performance in virtual screening campaigns, the presence of a relatively high number of irrelevant bits introduces noise into data and makes their application more time-consuming. In this study, we present a new method of hybrid reduced fingerprint construction, the Average Information Content Maximization algorithm (AIC-Max algorithm), which selects the most informative bits from a collection of fingerprints. This methodology, applied to the ligands of five cognate serotonin receptors (5-HT_2A, 5-HT_2B, 5-HT_2C, 5-HT_5A, 5-HT₆), proved that 100 bits selected from four non-hashed fingerprints reflect almost all structural information required for a successful in silico discrimination test. A classification experiment indicated that a reduced representation is able to achieve even slightly better performance than the state-of-the-art 10-times-longer fingerprints and in a significantly shorter time.

Citation: Śmieja M, Warszycki D (2016) Average Information Content Maximization—A New Approach for Fingerprint Hybridization and Reduction. PLoS ONE 11(1): e0146666. https://doi.org/10.1371/journal.pone.0146666

Editor: Paul Taylor, University of Edinburgh, UNITED KINGDOM

Received: September 11, 2015; Accepted: December 21, 2015; Published: January 19, 2016

Copyright: © 2016 Śmieja, Warszycki. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This study was fully supported by the National Centre of Science (Poland) grant no. 2014/13/N/ST6/01832. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Fingerprints are one of the most popular methods of converting chemical structures into a form that can be used in, e.g., machine learning experiments. They encode a compound’s structural features into a bitstring, where “1” and “0” mean the presence or absence, respectively, of a particular pattern. Fingerprints are divided into two subgroups: non-hashed fingerprints (e.g., Substructure fingerprint, Klekotha-Roth fingerprint), which encodes precisely defined structural patterns, and hashed fingerprints (e.g., Extended fingerprint, Graph-only fingerprint) which are without an assigned meaning for each bit (Fig 1). Fingerprints are widely used in classification problems or similarity searching; therefore, they have found application in computer-aided drug design campaigns [1–8].

Download:

Fig 1. Exemplary hashed (A) and non-hashed (B) fingerprints.

Presence of “1” and “0” corresponds to presence or absence of a particular pattern, repectively. In case of hashed fingerprint (A) bit collision phenomena is presented—one bit encodes more than one motif.

https://doi.org/10.1371/journal.pone.0146666.g001

A multitude of structural features present in chemical compounds results in fingerprints, among which, the longest one contains 4860 bits [9]. The physical impossibility of the occurrence of hundreds of chemical substructures in low-molecular-weight chemical compounds and the biological insignificance of many bits increase the noise level in classification experiments. Moreover, the high resolution of the data increases the computational time, which is crucial in large virtual screening cascades.

Therefore, the reduction of fingerprint length without the loss of any meaningful information has become an important cheminformatics challenge in recent years. Several methodologies, e.g., consensus fingerprints [10], bit scaling [11], reverse fingerprints [12] and bit silencing [13] were introduced to reduce fingerprints via the weighting of particular bits. Another approach proposed by Nisius et al. selects fingerprint bits according to their discrimination power which is measured by Kullback-Leibler divergence [14]. The method was applied to single fingerprints as well as to collections of fingerprints, leading to a successful attempt at fingerprint hybridization. [15].

In this study, we introduce a new method for fingerprint hybridization and reduction—Average Information Content Maximization (AIC-Max algorithm). The algorithm uses an extended version of mutual information, hereafter referred as the Average Information Content (AIC), to select the most informative bits of different fingerprints needed for splitting active from inactive compounds. In contrast to the aforementioned techniques, the AIC-Max algorithm may construct an optimal fingerprint for several biological targets. This approach substantially extends its application area. The strength of the AIC-Max algorithm stems from the fact that the selection process evaluates the discrimination power of entire groups of bits instead of single ones. Consequently, the algorithm will not select two features that carry similar information.

The proposed methodology was applied to create a reduced representation dedicated to the analysis of five closely related serotonin receptors: 5-HT_2A, 5-HT_2B, 5-HT_2C, 5-HT_5A and 5-HT₆ (members of the G-protein coupled receptor superfamily) that play an important role in, e.g., the central nervous system (CNS) [16]. The algorithm was additionally tested on four other targets families: carbonic anhydrases, cathepsins, histamine receptors and kinases (See S1 File). Although the advantages of hashed fingerprints cannot be denied, only non-hashed fingerprints were considered in the current study. This conscious abandonment of hashed fingerprints was due to the lack of predefined substructural features and bit collision phenomenon (the same bit is set by multiple patterns) commonly occurring in those fingerprints [17], which make the structural interpretation of particular fingerprint coordinates nearly impossible. A hybrid fingerprint, reduced to 100 bits, reflects 99.77% of the information needed to distinguish active compounds from inactive ones (Fig 2) and contains structural patterns typical for serotonin receptors ligands, such as positively polarizable nitrogen atoms and aromatic systems.

Download:

Fig 2. The relationship between the number of bits selected by the AIC-Max algorithm and information related activity.

The information, measured by AIC Eq (1), was averaged over all datasets used in the underlying study.

https://doi.org/10.1371/journal.pone.0146666.g002

A reduced representation significantly outperformed four standard non-hashed fingerprints in a classification experiment and achieved slightly better results in comparison to hashed fingerprints generated by PaDEL software [18] when a random forest classifier [19] was used. Moreover, the average training time of the random forest predictor compared to the Extended fingerprint was reduced almost 20 times. The constructed fingerprint generalized well to related biological targets such as the 5-HT_1A receptor as shown by additional tests. The results indicate that AIC-Max algorithm is an efficient method for fingerprint reduction and hybridization, opening new perspectives for both virtual screening campaigns and structural analysis of chemical space covered by ligands acting on similar targets.

Materials and Methods

The Average Information Content Maximization algorithm (AIC-Max algorithm) uses the notion of Average Information Content (AIC) to rank the features by their significance. The AIC quantifies the percentage of information that a set of features carries of the activity with respect to a set of biological receptors (the corresponding set of activity variables will be denoted by ). The AIC is defined as the mutual information normalized by the entropy SE(Y_i) [20–22], averaged over (1) where S_N = {0,1}^N is a set of all binary sequences of length N and P_i(y), P(x), P_i(x;y) denote the probabilities that {Y_i = y}, {X₁ = x₁, …, X_N = x_N}, {X₁ = x₁, …, X_N = x_N, Y_i = y}, respectively.

If fully determines the activity of all receptors, then AIC = 1; for independent of all elements of , it returns value 0. The set of features that reflects all the information of the activity against l receptors and none of the information for the remaining (k − l) receptors gives , as demonstrated in Table 1. For closely related biological targets, however, the most informative features usually overlap to a large extent.

Download:

Table 1. Minimal and maximal values of AIC.

The 3-bit fingerprint representation X₁ X₂ X₃ of eight compounds and their activity labels Y₁, Y₂, Y₃ given three biological targets, as listed in the table. Since the activity of the i-th receptor is fully determined by a single feature X_i, then AIC_{Y_i}(X_i) = 1, for i = 1,2,3. In contrast, AIC_{Y_i}(X_j) = 0, for i ≠ j because Y_i is independent of X_j. Finally, , since the activity of two out of three receptors was fully reflected by two bits.

https://doi.org/10.1371/journal.pone.0146666.t001

The important point is that the value of AIC depends on the joint information contained in all features included in . In particular, if X₁ = X₂ then The above equality always holds if the correlation between X₁ and X₂ equals 1. In other words, the repeated addition of the same feature does not increase the value of AIC. In contrast, the extension of the set of features by an additional element cannot decrease AIC, as illustrated in Table 2.

Download:

Table 2. Influence of dependent and independent bits on AIC.

The activity of a given receptor depends only on two out of four features: X₁ and X₂. The addition of feature X₃ to X₁ does not change AIC because it is independent of Y, which results in AIC_Y(X₁) = AIC_Y(X₁, X₃) = 0.38. The same holds for X₄, which is completely correlated with X₁, and AIC_Y(X₁) = AIC_Y(X₁, X₄) = 0.38.

https://doi.org/10.1371/journal.pone.0146666.t002

To calculate AIC for a given set of receptors , the datasets of compounds for each can be created separately. This consideration implies that a single instance (compound) does not have a known activity label for all considered receptors. It is an important property because most of the compounds have proven activity (or inactivity) only for one receptor. It is worth mentioning that this reasoning cannot be applied to classical mutual information, where the activity of every compound has to be provided to perform analogical evaluation.

Given a set of all features (fingerprint coordinates), the goal is to find an N-element subset of such that is maximal. In practice, it might be impossible to calculate AIC for all subsets of features to determine the most informative one (e.g, the number of m-element subsets of n-features equals which even for n = 1000 and m = 10 gives about 2 ⋅ 10²³). The proposed AIC-Max algorithm uses a heuristic search in the space of all features to reduce the computational time of the entire selection process. It iteratively picks these coordinates which maximize —the information contained in already chosen features. The selection of N features is described as follows:

AIC-Max algorithm:

Input: – set of given features

Output: – set of selected features

1. initialize ,

2. iterate N-times:

(a) find which maximizes ,

(b) update .

To provide more efficient computations, the calculation of AIC in step 2a can be performed for a randomly selected n ≤ N element subset of —in the experiments we used n = 10.

The concept of the AIC is based on information theory and is partially related to Asymmetric Clustering Index [23]. The most fundamental concept in information theory is Shannon entropy (SE), which quantifies the information contained in a given feature X [20]. Formally, if X takes values in {1, …, k}, then: where P(i) is a probability of observation {Y = i}. Note, that SE(Y) = 0 if X = constant. In contrast, if all values of X are equally probable, then SE attains a maximal value of log₂ k.

To measure the joint information shared by two features, the notion of mutual information (MI) has to be used [20]. For X and Y taking values in {1, …, k}, the MI is formulated as follows: (2) where P(i;j) is the probability that {X = i, Y = j}. It can also be naturally extended to the set of features : the indexes i and j in the above expression must to be replaced by sequences of indexes (i₁, …, i_n), (j₁, …, j_k), respectively [20].

The evaluation of MI for a set of features and a set of receptors requires a single data set of chemical compounds and corresponding activity labels for all receptors. This makes technically impossible the application of MI for a determination of the most informative subset of features with respect to various receptors because there usually does not exist a representative data set where each compound has proven activity or inactivity given arbitrary .

To overcome this problem, the calculation of was replaced by the computation of individual factors . These partial results are gathered into final form by averaging: The normalization by the entropy of Y_i ensures that every factor describes the percentage of joint information instead of the absolute amount of information. In particular:

Results and Discussion

The experiments concerned the application of the AIC-Max algorithm for the selection of the most significant bits for ligands acting on five closely related biological receptors: 5-HT_2A, 5-HT_2B, 5-HT_2C, 5-HT_5A, 5-HT₆. Among all fingerprints generated in the PaDEL software, only non-hashed fingerprints were considered: EState, MACCS, PubChem and Substructure (possessing 1434 bits in total) to ensure the structural analysis of selected bits (Table 3). Although hashed representations can be more efficient for classification purposes, their coordinates do not have a straightforward meaning. Therefore, they were not incorporated into the selection process. Moreover, the longest fingerprint (KRFP), although it was non-hashed, was skipped because a high number of bits results in a rapid increase of the computational time required by the feature selection process. Clearly, some of the chemical patterns can be duplicated while concatenating the above four fingerprints together. Nevertheless, since the repeated addition of the same feature does not increase the value of AIC, there is no risk that the algorithm will pick two identical (or even very similar) bits for final representation.

Download:

Table 3. Fingerprints generated in PaDEL software [18].

https://doi.org/10.1371/journal.pone.0146666.t003

All ligands were extracted from ChEMBL database version 20 (February 2015) [27]. Ligands with an inhibition constant (K_i) less than or equal to 100 nM were considered active; ligands with K_i higher than 1000 nM were used as inactives. Putative inactive compounds were randomly selected from the ZINC database [28] in a ratio of 9 inactives per 1 active (Table 4) [29].

Download:

Table 4. The summary of datasets used in the selection process.

https://doi.org/10.1371/journal.pone.0146666.t004

To evaluate the significance of the selected features, a 10-fold cross-validation was performed [30]. In this approach, a dataset is randomly partitioned into 10 equally sized subsets. Then, a single subset is retained as test data while the remaining 9 subsets are used in training. This process is repeated 10 times—each of 10 subsamples is used exactly once as the test data, and the results are averaged. The AIC-Max algorithm was run on a training data set (including actives, inactives and putative inactives), and the evaluation of selected features was reported for a test set. The score was measured by the normalized mutual information Eq (2) between the constructed representation and the true activity labels for each of the receptors.

Information stored in a reduced fingerprint grows gradually with the increase in the number of features selected by AIC-Max algorithm (Fig 3). The level of 90% was rapidly attained by a representation containing approximately 20 bits for both datasets containing true inactives and compounds selected from ZINC. Nevertheless, to distinguish almost all considered active compounds from inactives, a set of 100 bits is required (more than 99% of information), while for putative inactives, only 30 bits suffice (close to 100% of information). This outcome is due to two particular reasons: the close structural similarity between actives and true inactives and the small amount of compounds with confirmed inactivity (Table 4).

Download:

Fig 3. The relationship between the number of bits selected by the AIC-Max algorithm and associated information of activity.

The information score was measured by the normalized mutual information calculated for constructed representations for every receptor averaged over all folds reported on a test set.

https://doi.org/10.1371/journal.pone.0146666.g003

Because the AIC-Max algorithm returned slightly different subsets of bits in each fold, the algorithm was additionally applied to the entire dataset to obtain a single set of features. The reduced fingerprint (see S1 File for details) contained features that are crucial in ligand-protein interaction for serotonin receptors: a positively polarizable nitrogen atom and an aromatic system [31]. Moreover, the bit encoding the tertiary nitrogen atom is the most desirable in the reduction and hybridization process. Polarizable nitrogen atoms are encoded by several bits listed in the top-scored instances. The same situation can also be observed for the aromatic system, which appears three times out of the 10 most desirable bits. Amide and sulfonamide moieties (and their subelements) are another popular patterns present in universal fingerprint, which reflect actual trends in medicinal chemistry [32–36].

The quality of the bits chosen by the AIC-Max algorithm was verified in a classification experiment conducted for the 5 underlying serotonin receptor ligands. As a classification method, a random forests technique [19] implemented in randomForest R package was used because it is known to be one of the state-of-the-art approaches in activity prediction [6]. The accuracy of classification was evaluated via Matthews Correlation Coefficient (MCC), the well-known validation measure, especially for imbalanced datasets. This measure is defined as [37]: where TP stands for the number of true positives (actives labeled as actives), TN—true negatives, FP—false positives (inactives labeled as actives) and FN—false negatives. MCC takes values from -1 to +1; The number +1 represents perfect prediction while 0 represents random prediction and − 1 represents an inverse prediction.

The experiment also assumed a 10-fold cross-validation procedure; a training set was used for a selection of bits and training of a classifier which was then evaluated on a test set. In each fold the AIC-Max algorithm was run for a merged set of actives, inactives and putative inactives to enforce generality of representation. On the other hand, the classifier was trained and tested separately on compounds of proven activity and on datasets containing active and putative inactive compounds.

The addition of new features leads to the statistical improvement of the classification results (Fig 4). The highest increase was reported for representations including less than 20 bits. For a higher number of features, the difference in classification accuracy changes slightly. Because the gain in MCC value for representations containing more than 100 bits is negligible; then, longer representations were not taken into further consideration.

Download:

Fig 4. Classification performance.

The relationship between the number of bits selected by AIC-Max algorithm and associated MCC score for every receptor averaged over all folds reported on a test set.

https://doi.org/10.1371/journal.pone.0146666.g004

The classification performance of the representation created for 25, 50 and 100 bits was then compared with original (raw) fingerprints (Tables 5 and 6). The reduced representations including 100 as well as 50 bits outperformed existing fingerprints on all receptors when putative inactive compounds were used. This case is considered the most important one because it reflects virtual screening campaigns [29]. In the case of true inactives, the average MCC score of representation including 100 coordinates was comparable to the best performing hashed fingerprints. Moreover, the time required for training a classifier was approximately 17 times lower when a reduced 100-bits representation was used instead of any of the hashed fingerprints (Fig 5).

Download:

Table 5. Classification performance on a dataset containing actives and inactives.

https://doi.org/10.1371/journal.pone.0146666.t005

Download:

Table 6. Classification performance on a dataset containing actives and putative inactives.

https://doi.org/10.1371/journal.pone.0146666.t006

Download:

Fig 5. Classification times.

Mean training times of a random forest classifier for various fingerprint representations averaged over all data sets of active and inactive compounds.

https://doi.org/10.1371/journal.pone.0146666.g005

Finally, the generalization ability of created representation for another serotonin receptor was examined. A classification experiment was conducted on 5-HT_1A receptor ligands assuming reduced representation selected for five base receptors. Surprisingly, the extended fingerprint achieved perfect precision for the first dataset including compounds with proven activity or inactivity (Table 7). Although the reduced representation gave a significantly lower result, MCC = 0.663, it performed better than any of non-hashed fingerprints. In the case of putative inactives, the performance of constructed representation was slightly better than the MACCS and Extended fingerprints.

Download:

Table 7. Classification performance on a dataset containing active and inactive compounds of 5-HT_1A receptor (middle column) as well as actives and putative inactives (last column).

The reduced representation was constructed from four non-hashed fingerprints based on five biological targets (first 3 rows). The reduced representation from all fingerprints (except KRFP) was also evaluated (last row).

https://doi.org/10.1371/journal.pone.0146666.t007

To complement the study and investigate deeper the discriminative power of Extended fingerprint, we also considered a representation created from all fingerprints (Table 3) except KRFP including hashed ones. The results (Table 7) showed that the enhancement by bits from the hashed fingerprints significantly improved the statistics and gave almost ideal separation of actives from inactives.

Analogue experiments were conducted also for four another families of biological targets: carbonic anhydrases, cathepsins, histamine receptors and kinases (see S1 File).

Conclusion

The paper introduced the AIC-Max algorithm as a method for fingerprint reduction and hybridization. The algorithm iteratively picks features uncorrelated among themselves to maximize AIC—a modified version of mutual information. In the present study, the algorithm was applied for constructing an essential representation of ligands of five families of closely related tergets. Such a representation can compete with raw fingerprints in classification experiments with significant CPU time reduction. The obtained results confirm that existing fingerprints contain much irrelevant information that may negatively influence on screening performance. The conducted experiments indicate that the generation and application of reduced and hybridized fingerprint allow rapid and effective calculations. The power of the methodology is underlined by the presence in universal representation bits that encode the most important structural features for serotonin receptor ligands: a polarizable nitrogen atom and the aromatic system.

Supporting Information

S1 File. The additional file, which can be retrieved from: http://www.ii.uj.edu.pl/~smieja/aic, contains the full list of 100 most informative bits selected from four non hashed fingerprints for five GPCRS receptors (Table A in S1 File) and the results of experiments conduced for the families of carbonic anhydrases (Tables B, F, J and K in S1 File), cathepsins (Tables C, G, L and M in S1 File, histamine receptors (Tables D, H, N and O in S1 File) and kinases (Tables E, I, Q and P in S1 File).

https://doi.org/10.1371/journal.pone.0146666.s001

(PDF)

Acknowledgments

This study was supported by the National Centre of Science (Poland) grant no. 2014/13/N/ST6/01832.

The authors are very grateful to the reviewers for many useful remarks and for suggesting the extensions of the experiments on different biological targets. We would also like to thank professor Jacek Tabor and professor Andrzej Bojarski for their invaluable contribution to our work, discussions and criticism.

Author Contributions

Conceived and designed the experiments: MŚ DW. Performed the experiments: MŚ DW. Analyzed the data: MŚ DW. Contributed reagents/materials/analysis tools: MŚ DW. Wrote the paper: MŚ DW.

References

1. Kurczab R, Nowak M, Chilmonczyk Z, Sylte I, Bojarski AJ. The development and validation of a novel virtual screening cascade protocol to identify potential serotonin 5-HT 7 R antagonists. Bioorganic & medicinal chemistry letters. 2010;20(8):2465–2468.
- View Article
- Google Scholar
2. Zajdel P, Kurczab R, Grychowska K, Satała G, Pawłowski M, Bojarski AJ. The multiobjective based design, synthesis and evaluation of the arylsulfonamide/amide derivatives of aryloxyethyl-and arylthioethyl-piperidines and pyrrolidines as a novel class of potent 5-HT 7 receptor antagonists. European journal of medicinal chemistry. 2012;56:348–360. pmid:22926225
- View Article
- PubMed/NCBI
- Google Scholar
3. Gabrielsen M, Kurczab R, Siwek A, Wolak M, Ravna AW, Kristiansen K, et al. Identification of novel serotonin transporter compounds by virtual screening. Journal of chemical information and modeling. 2014;54(3):933–943. pmid:24521202
- View Article
- PubMed/NCBI
- Google Scholar
4. Witek J, Smusz S, Rataj K, Mordalski S, Bojarski AJ. An application of machine learning methods to structural interaction fingerprints—a case study of kinase inhibitors. Bioorganic & medicinal chemistry letters. 2014;24(2):580–585.
- View Article
- Google Scholar
5. Smusz S, Kurczab R, Satała G, Bojarski AJ. Fingerprint-based consensus virtual screening towards structurally new 5-HT 6 R ligands. Bioorganic & medicinal chemistry letters. 2015;25(9):1827–1830.
- View Article
- Google Scholar
6. Smusz S, Mordalski S, Witek J, Rataj K, Kafel R, Bojarski AJ. Multi-Step Protocol for Automatic Evaluation of Docking Results Based on Machine Learning Methods? A Case Study of Serotonin Receptors 5-HT6 and 5-HT7. Journal of chemical information and modeling. 2015;55(4):823–832. pmid:25806997
- View Article
- PubMed/NCBI
- Google Scholar
7. Staroń J, Warszycki D, Kalinowska-Tłuścik J, Satała G, Bojarski AJ. Rational design of 5-HT 6 R ligands using a bioisosteric strategy: synthesis, biological evaluation and molecular modelling. RSC Advances. 2015;5(33):25806–25815.
- View Article
- Google Scholar
8. Czarnecki WM, Tabor J. Multithreshold entropy linear classifier: Theory and applications. Expert Systems with Applications. 2015;42(13):5591–5606.
- View Article
- Google Scholar
9. Klekota J, Roth FP. Chemical substructures that enrich for biological activity. Bioinformatics. 2008;24(21):2518–2525. pmid:18784118
- View Article
- PubMed/NCBI
- Google Scholar
10. Shemetulskis NE, Weininger D, Blankley CJ, Yang J, Humblet C. Stigmata: an algorithm to determine structural commonalities in diverse datasets. Journal of chemical information and computer sciences. 1996;36(4):862–871. pmid:8768771
- View Article
- PubMed/NCBI
- Google Scholar
11. Xue L, Stahura FL, Bajorath J. Similarity search profiling reveals effects of fingerprint scaling in virtual screening. Journal of chemical information and computer sciences. 2004;44(6):2032–2039. pmid:15554672
- View Article
- PubMed/NCBI
- Google Scholar
12. Williams C. Reverse fingerprinting, similarity searching by group fusion and fingerprint bit importance. Molecular diversity. 2006;10(3):311–332. pmid:17031535
- View Article
- PubMed/NCBI
- Google Scholar
13. Wang Y, Bajorath J. Bit silencing in fingerprints enables the derivation of compound class-directed similarity metrics. Journal of chemical information and modeling. 2008;48(9):1754–1759. pmid:18698839
- View Article
- PubMed/NCBI
- Google Scholar
14. Nisius B, Vogt M, Bajorath J. Development of a Fingerprint Reduction Approach for Bayesian Similarity Searching Based on Kullback- Leibler Divergence Analysis. Journal of chemical information and modeling. 2009;49(6):1347–1358. pmid:19480403
- View Article
- PubMed/NCBI
- Google Scholar
15. Nisius B, Bajorath J. Reduction and recombination of fingerprints of different design increase compound recall and the structural diversity of hits. Chemical biology & drug design. 2010;75(2):152–160.
- View Article
- Google Scholar
16. McCorvy JD, Roth BL. Structure and function of serotonin G protein-coupled receptors. Pharmacology & therapeutics. 2015;150:129–142.
- View Article
- Google Scholar
17. Raevsky OA. Molecular structure descriptors in the computer-aided design of biologically active compounds. Russian chemical reviews. 1999;68(6):505–524.
- View Article
- Google Scholar
18. Yap CW. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry. 2011;32(7):1466–1474. pmid:21425294
- View Article
- PubMed/NCBI
- Google Scholar
19. Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
- View Article
- Google Scholar
20. Cover TM, Thomas JA. Elements of information theory. John Wiley & Sons; 2012.
21. MacKay DJ. Information theory, inference and learning algorithms. Cambridge university press; 2003.
22. Spurek P, Tabor J. The memory center. Information Sciences. 2013;252:132–143.
- View Article
- Google Scholar
23. Śmieja M, Warszycki D, Tabor J, Bojarski AJ. Asymmetric Clustering Index in a Case Study of 5-HT1A Receptor Ligands. PLoS ONE. 2014;9(7): e102069. pmid:25019251
- View Article
- PubMed/NCBI
- Google Scholar
24. Hall LH, Kier LB. Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. Journal of Chemical Information and Computer Sciences. 1995;35(6):1039–1045.
- View Article
- Google Scholar
25. Ewing T, Baber JC, Feher M. Novel 2D fingerprints for ligand-based virtual screening. Journal of Chemical Information and Modeling. 2006;46(6):2423–2431. pmid:17125184
- View Article
- PubMed/NCBI
- Google Scholar
26. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E. The Chemistry Development Kit (CDK): An open-source Java library for chemo-and bioinformatics. Journal of Chemical Information and Computer Sciences. 2003;43(2):493–500. pmid:12653513
- View Article
- PubMed/NCBI
- Google Scholar
27. Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, et al. The ChEMBL bioactivity database: an update. Nucleic acids research. 2014;42(D1):D1083–D1090. pmid:24214965
- View Article
- PubMed/NCBI
- Google Scholar
28. Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling. 2012;52(7):1757–1768. pmid:22587354
- View Article
- PubMed/NCBI
- Google Scholar
29. Kurczab R, Smusz S, Bojarski AJ. The influence of negative training set size on machine learning-based virtual screening. Journal of cheminformatics. 2014;6(1):32. pmid:24976867
- View Article
- PubMed/NCBI
- Google Scholar
30. Alpaydin E. Introduction to Machine Learning. The MIT Press; 2009.
31. Bojarski AJ. Pharmacophore models for metabotropic 5-HT receptor ligands. Current topics in medicinal chemistry. 2006;6(18):2005–2026. pmid:17017971
- View Article
- PubMed/NCBI
- Google Scholar
32. Zajdel P, Pawlowski M, Martinez J, Subra G. Combinatorial chemistry on solid support in the search for central nervous system agents. Combinatorial chemistry & high throughput screening. 2009;12(7):723–739.
- View Article
- Google Scholar
33. Zajdel P, Marciniec K, Maślankiewicz A, Satała G, Duszyńska B, Bojarski AJ, et al. Quinoline-and isoquinoline-sulfonamide derivatives of LCAP as potent CNS multi-receptor –5-HT 1A/5-HT 2A/5-HT 7 and D 2/D 3/D 4 agents: The synthesis and pharmacological evaluation. Bioorganic & medicinal chemistry. 2012;20(4):1545–1556.
- View Article
- Google Scholar
34. Partyka A, Chłoń-Rzepa G, Wasik A, Jastrzebska-Wiesek M, Bucki A, Kołaczkowski M, et al. Antidepressant-and anxiolytic-like activity of 7-phenylpiperazinylalkyl-1, 3-dimethyl-purine-2, 6-dione derivatives with diversified 5-HT 1A receptor functional profile. Bioorganic & medicinal chemistry. 2015;23(1):212–221.
- View Article
- Google Scholar
35. Canale V, Kurczab R, Partyka A, Satała G, Witek J, Jastrzebska-Wiesek M, et al. Towards novel 5-HT 7 versus 5-HT 1A receptor ligands among LCAPs with cyclic amino acid amide fragments: Design, synthesis, and antidepressant properties. Part II. European journal of medicinal chemistry. 2015;92:202–211. pmid:25555143
- View Article
- PubMed/NCBI
- Google Scholar
36. Chłoń-Rzepa G, Zagórska A, Bucki A, Kołaczkowski M, Pawłowski M, Satała G, et al. New Arylpiperazinylalkyl Derivatives of 8-Alkoxy-purine-2, 6-dione and Dihydro [1, 3] oxazolo [2, 3-f] purinedione Targeting the Serotonin 5-HT1A/5-HT2A/5-HT7 and Dopamine D2 Receptors. Archiv der Pharmazie. 2015;348(4):242–253. pmid:25773907
- View Article
- PubMed/NCBI
- Google Scholar
37. Fawcett T. An introduction to ROC analysis. Pattern recognition letters. 2006;27(8):861–874.
- View Article
- Google Scholar

[ref1] 1. Kurczab R, Nowak M, Chilmonczyk Z, Sylte I, Bojarski AJ. The development and validation of a novel virtual screening cascade protocol to identify potential serotonin 5-HT 7 R antagonists. Bioorganic & medicinal chemistry letters. 2010;20(8):2465–2468.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Zajdel P, Kurczab R, Grychowska K, Satała G, Pawłowski M, Bojarski AJ. The multiobjective based design, synthesis and evaluation of the arylsulfonamide/amide derivatives of aryloxyethyl-and arylthioethyl-piperidines and pyrrolidines as a novel class of potent 5-HT 7 receptor antagonists. European journal of medicinal chemistry. 2012;56:348–360. pmid:22926225
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Gabrielsen M, Kurczab R, Siwek A, Wolak M, Ravna AW, Kristiansen K, et al. Identification of novel serotonin transporter compounds by virtual screening. Journal of chemical information and modeling. 2014;54(3):933–943. pmid:24521202
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Witek J, Smusz S, Rataj K, Mordalski S, Bojarski AJ. An application of machine learning methods to structural interaction fingerprints—a case study of kinase inhibitors. Bioorganic & medicinal chemistry letters. 2014;24(2):580–585.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Smusz S, Kurczab R, Satała G, Bojarski AJ. Fingerprint-based consensus virtual screening towards structurally new 5-HT 6 R ligands. Bioorganic & medicinal chemistry letters. 2015;25(9):1827–1830.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. Smusz S, Mordalski S, Witek J, Rataj K, Kafel R, Bojarski AJ. Multi-Step Protocol for Automatic Evaluation of Docking Results Based on Machine Learning Methods? A Case Study of Serotonin Receptors 5-HT6 and 5-HT7. Journal of chemical information and modeling. 2015;55(4):823–832. pmid:25806997
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Staroń J, Warszycki D, Kalinowska-Tłuścik J, Satała G, Bojarski AJ. Rational design of 5-HT 6 R ligands using a bioisosteric strategy: synthesis, biological evaluation and molecular modelling. RSC Advances. 2015;5(33):25806–25815.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref8] 8. Czarnecki WM, Tabor J. Multithreshold entropy linear classifier: Theory and applications. Expert Systems with Applications. 2015;42(13):5591–5606.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref9] 9. Klekota J, Roth FP. Chemical substructures that enrich for biological activity. Bioinformatics. 2008;24(21):2518–2525. pmid:18784118
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Shemetulskis NE, Weininger D, Blankley CJ, Yang J, Humblet C. Stigmata: an algorithm to determine structural commonalities in diverse datasets. Journal of chemical information and computer sciences. 1996;36(4):862–871. pmid:8768771
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Xue L, Stahura FL, Bajorath J. Similarity search profiling reveals effects of fingerprint scaling in virtual screening. Journal of chemical information and computer sciences. 2004;44(6):2032–2039. pmid:15554672
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Williams C. Reverse fingerprinting, similarity searching by group fusion and fingerprint bit importance. Molecular diversity. 2006;10(3):311–332. pmid:17031535
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Wang Y, Bajorath J. Bit silencing in fingerprints enables the derivation of compound class-directed similarity metrics. Journal of chemical information and modeling. 2008;48(9):1754–1759. pmid:18698839
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Nisius B, Vogt M, Bajorath J. Development of a Fingerprint Reduction Approach for Bayesian Similarity Searching Based on Kullback- Leibler Divergence Analysis. Journal of chemical information and modeling. 2009;49(6):1347–1358. pmid:19480403
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Nisius B, Bajorath J. Reduction and recombination of fingerprints of different design increase compound recall and the structural diversity of hits. Chemical biology & drug design. 2010;75(2):152–160.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref16] 16. McCorvy JD, Roth BL. Structure and function of serotonin G protein-coupled receptors. Pharmacology & therapeutics. 2015;150:129–142.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref17] 17. Raevsky OA. Molecular structure descriptors in the computer-aided design of biologically active compounds. Russian chemical reviews. 1999;68(6):505–524.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref18] 18. Yap CW. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry. 2011;32(7):1466–1474. pmid:21425294
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref19] 19. Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref20] 20. Cover TM, Thomas JA. Elements of information theory. John Wiley & Sons; 2012.

[ref21] 21. MacKay DJ. Information theory, inference and learning algorithms. Cambridge university press; 2003.

[ref22] 22. Spurek P, Tabor J. The memory center. Information Sciences. 2013;252:132–143.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref23] 23. Śmieja M, Warszycki D, Tabor J, Bojarski AJ. Asymmetric Clustering Index in a Case Study of 5-HT1A Receptor Ligands. PLoS ONE. 2014;9(7): e102069. pmid:25019251
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref24] 24. Hall LH, Kier LB. Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. Journal of Chemical Information and Computer Sciences. 1995;35(6):1039–1045.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref25] 25. Ewing T, Baber JC, Feher M. Novel 2D fingerprints for ligand-based virtual screening. Journal of Chemical Information and Modeling. 2006;46(6):2423–2431. pmid:17125184
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref26] 26. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E. The Chemistry Development Kit (CDK): An open-source Java library for chemo-and bioinformatics. Journal of Chemical Information and Computer Sciences. 2003;43(2):493–500. pmid:12653513
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref27] 27. Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, et al. The ChEMBL bioactivity database: an update. Nucleic acids research. 2014;42(D1):D1083–D1090. pmid:24214965
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref28] 28. Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling. 2012;52(7):1757–1768. pmid:22587354
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref29] 29. Kurczab R, Smusz S, Bojarski AJ. The influence of negative training set size on machine learning-based virtual screening. Journal of cheminformatics. 2014;6(1):32. pmid:24976867
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref30] 30. Alpaydin E. Introduction to Machine Learning. The MIT Press; 2009.

[ref31] 31. Bojarski AJ. Pharmacophore models for metabotropic 5-HT receptor ligands. Current topics in medicinal chemistry. 2006;6(18):2005–2026. pmid:17017971
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref32] 32. Zajdel P, Pawlowski M, Martinez J, Subra G. Combinatorial chemistry on solid support in the search for central nervous system agents. Combinatorial chemistry & high throughput screening. 2009;12(7):723–739.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref33] 33. Zajdel P, Marciniec K, Maślankiewicz A, Satała G, Duszyńska B, Bojarski AJ, et al. Quinoline-and isoquinoline-sulfonamide derivatives of LCAP as potent CNS multi-receptor –5-HT 1A/5-HT 2A/5-HT 7 and D 2/D 3/D 4 agents: The synthesis and pharmacological evaluation. Bioorganic & medicinal chemistry. 2012;20(4):1545–1556.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref34] 34. Partyka A, Chłoń-Rzepa G, Wasik A, Jastrzebska-Wiesek M, Bucki A, Kołaczkowski M, et al. Antidepressant-and anxiolytic-like activity of 7-phenylpiperazinylalkyl-1, 3-dimethyl-purine-2, 6-dione derivatives with diversified 5-HT 1A receptor functional profile. Bioorganic & medicinal chemistry. 2015;23(1):212–221.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref35] 35. Canale V, Kurczab R, Partyka A, Satała G, Witek J, Jastrzebska-Wiesek M, et al. Towards novel 5-HT 7 versus 5-HT 1A receptor ligands among LCAPs with cyclic amino acid amide fragments: Design, synthesis, and antidepressant properties. Part II. European journal of medicinal chemistry. 2015;92:202–211. pmid:25555143
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref36] 36. Chłoń-Rzepa G, Zagórska A, Bucki A, Kołaczkowski M, Pawłowski M, Satała G, et al. New Arylpiperazinylalkyl Derivatives of 8-Alkoxy-purine-2, 6-dione and Dihydro [1, 3] oxazolo [2, 3-f] purinedione Targeting the Serotonin 5-HT1A/5-HT2A/5-HT7 and Dopamine D2 Receptors. Archiv der Pharmazie. 2015;348(4):242–253. pmid:25773907
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref37] 37. Fawcett T. An introduction to ROC analysis. Pattern recognition letters. 2006;27(8):861–874.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Results and Discussion

Conclusion

Supporting Information

Acknowledgments

Author Contributions

References