Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Discovering Associations in Biomedical Datasets by Link-based Associative Classifier (LAC)

  • Pulan Yu,

    Affiliation School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America

  • David J. Wild

    djwild@indiana.edu

    Affiliation School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America

Abstract

Associative classification mining (ACM) can be used to provide predictive models with high accuracy as well as interpretability. However, traditional ACM ignores the difference of significances among the features used for mining. Although weighted associative classification mining (WACM) addresses this issue by assigning different weights to features, most implementations can only be utilized when pre-assigned weights are available. In this paper, we propose a link-based approach to automatically derive weight information from a dataset using link-based models which treat the dataset as a bipartite model. By combining this link-based feature weighting method with a traditional ACM method–classification based on associations (CBA), a Link-based Associative Classifier (LAC) is developed. We then demonstrate the application of LAC to biomedical datasets for association discovery between chemical compounds and bioactivities or diseases. The results indicate that the novel link-based weighting method is comparable to support vector machine (SVM) and RELIEF method, and is capable of capturing significant features. Additionally, LAC is shown to produce models with high accuracies and discover interesting associations which may otherwise remain unrevealed by traditional ACM.

Introduction

Chemical and biological data contain information about various characteristics of compounds, genes, proteins, pathways and diseases. Thus a wide spectrum of data mining methods is used to identify relationships in these large and multidimensional datasets and to generate predictive models with high accuracy and interpretability. Recently, associative classification mining (ACM) has been widely used for this purpose [1][4]. ACM is a data mining framework utilizing association rule mining (ARM) technique to construct classification systems, also known as associative classifiers. An associative classifier consists of a set of classification association rules (CARs) [5] which have the form of X→Y whose right-hand-side Y is restricted to the classification class attribute. X→Y can be simply interpreted as if X then Y. ARM is introduced by Agrawal et al [6] to discover CARs which satisfy the user specified constraints denoted respectively by minimum support (minsup) and minimum confidence (minconf) threshold. Given a dataset with each row representing a compound, each column (called as item, feature or attribute) is a test result of this compound on a tumor cell line and all compounds are labeled as active or inactive class, a possible classification association rule can be {MCF7 inactive, HL60 (TB) inactive → inactive} with support = 0.6 and confidence = 0.8. This particular rule states that when a compound is inactive to both MCF7 cell line and HL60 (TB) cell line, it tends to be inactive. The support, which is the probability of a compound being inactive to both MCF7 and HL60 (TB) and being classified as inactive together, is 0.6; the confidence, which is the probability of a compound to be inactive given inactive to both MCF7 and HL60 (TB), is 0.8. In ACM, the relationship between attributes and class is based on the analysis of their co-occurrences within the database so it can reveal interesting correlations or associations among them. For this reason, it has been applied to the biomedical domain especially to address gene expression relations [7][11], protein-protein interactions [12], protein-DNA interactions [13], and genotype and phenotype mapping [14] inter alia.

Traditional ACM does not consider feature weight, and therefore all features are treated identically, namely, with equal weight. However, in reality, the importance of feature/item is different. For instance, {beef → beer} with support = 0.01 and confidence = 0.8 may be more important than {chips → beer} with support = 0.03 and confidence = 0.85 even though the former holds a lower support and confidence. Items/features in the first rule have more profit per unit sale so they are more valuable. Wang et al [15][17] proposed a framework called weighted association rule mining (WARM) to address the importance of individual attributes. The main idea is that a numerical attribute can be assigned to every attribute to represent its significance. For example, {Hypertension = yes, age>50→ Heart_Disease} with {Hypertension = yes, 0.8}, {age>50, 0.3} is a rule mined by WARM. The importance of hypertension and age >50 to heart disease is different and denoted by value 0.8 and 0.3 respectively. The major difference between ARM and WARM is how the support is computed. Several frameworks are developed to incorporate weight information for support calculation [15][22]. Studies have been carried out on WARM by using pre-assigned weights. Nonetheless, most datasets do not contain those pre-assigned weight information.

thumbnail
Figure 1. The bipartite model of a dataset.

(The bipartite model is also a heterogeneous system. Blue represents active compounds and red for inactive compounds with both contributing to the green node-feature/attribute.).

https://doi.org/10.1371/journal.pone.0051018.g001

In machine learning, feature selection and feature weighting are broadly used to deal with the significance of features and derive weight information automatically from a dataset itself. Feature selection is a technique of selecting a subset of relevant features by removing low significant features; feature weighting is a technique of approximating the optimal degree of influence of individual features. Feature weighting preserves all features by assigning smaller weight to relatively insignificant features and has the advantage of taking into account of all features as well as not requiring searching an appropriate cut-threshold [23]. In some circumstances, it might be the only option when eliminating features with a low contribution to classification is inappropriate. Especially, to understand the overall relationship between genes and a disease, a small subset of genes although having good prediction ability may not have sufficient discriminating power [24]. Like feature selection, feature weighting approaches fall into two categories: 1) filter methods which are performed in a pre-processing step before modeling; 2) wrapper methods which are iterative and generally use the same learning algorithm as modeling. In wrapper methods, the evaluation result of relevancy is used for feature weighting. Usually, wrapper methods perform better than filter methods while filter methods are faster and cheaper.

Sun et al. [25] proposed a link-based filter feature weighting approach. The weights are derived from the dataset itself by extending Kleinberg’s HITS (Hyper Induced Topic Selection) model [26] and algorithm on bipartite graphs. HITS and PageRank are two major link-based ranking algorithms. PageRank is developed by Brin and Page [27] and has been commercially successfully used in the search engine Google. HITS ranks webpages by analyzing the in-links and out-links. Webpages pointed to by many other pages are defined as “authority” while webpages linked to many other pages are called “hub”. HITS emphasizes the notion of “mutual reinforcement” between the “authority” and “hub”. Its intuitive interpretation is that a good “authority” is pointed to by a lot of good “hubs” and a good “hub” points to many good “authorities”. PageRank uses a very similar idea that a “good” webpage should be linked or link to other “good” webpages. Unlike the “mutual reinforcement” approach, it focuses on hyperlink weight normalization and web surfing based on random walk models. Both approaches have pros and cons. The computation of PageRank is stable and its behavior is well-defined due to the probabilistic interpretation. Furthermore, PageRank can be used on large page collections because even though the larger communities will affect the final ranking, they will not overwhelm the small ones. In contrast, HITS is not stable and cannot be applied to large page collections since only the largest web community will influence the final ranking. However, it can capture the relationships among the webpages with more details [28]. Hence, an algorithm capable of integrating both HITS and PageRank may improve Sun’s weighting method.

The general PageRank cannot be applied to bipartite graphs as it produces different rankings for webpages with the same in-links [29], as a result, a better ranking scheme is needed for ranking in bipartite graphs while integrating PageRank and HITS [30]. The SALAS (stochastic approach for link structure analysis) [31][33] combines the random surf model of PageRank with hub/authority principle of HITS. It generates a bipartite undirected graph H based on the web graph G. One subset of H contains all the nodes with positive in-degree (the potential “authorities”) and the other subset consists of all the nodes with positive out-degree (the potential “hubs”). A travel is completed by a two-step random walk. For example, from the “hub” to the “authority” and from the “authority” back to the “hub”. As in the PageRank, each individual walk is a Markov process with a well-defined transition probability matrix [31]. Nevertheless, besides SALAS does not really implement the “mutual reinforcement” of HITS because the scores of both authority and hub are not related by the hub to authority and authority to hub reinforcement operations, its score propagation differs from HITS (a similarity-mediated score propagation). Moreover, its random walk model does not directly simulate the behavior of the surfer in PageRank either. For SALAS, a surfer can jump from webpage pi to pj even though there is no hyperlink between them, and there is no link-interrupt jumps. Based on a similar approach as SALAS, Ding et al proposed a unified framework integrating HITS and PageRank [34].

Figure 1 indicates that a database can be represented by a bipartite graph equally [25]. In the graph, left is the table layout representation and can be represented by the bipartite graph on the right. Compounds and features linked to each other can be viewed as webpages. As a consequence, the link-based algorithms used to rank the webpage such as HITS or PageRank can be utilized to rank compounds or features. The algorithms say that if a webpage has many important links to it, the links from it to other webpages become important too. For our case, this means a highly weighted compound should contain many highly weighted features and a highly weighted feature should exist in many highly weighted compounds. Accordingly, the ranking score can be used for feature weighting. Although Ding’s unified framework can be used to derive the ranking score automatically, it cannot distinguish the contributions of different types of connections. For chemical dataset mining, each chemical feature may connect to both active and inactive compounds; for biological dataset mining, each gene may connect to a disease either as suppressor or activator. Chemical features existing frequently in active compounds or genes major associated with suppressors are more interested in. In Figure 1, when we consider the contribution of compounds to the weight of a node/attribute 78, we want to distinguish the contribution of compound 5469540 from the contribution of compound 840827 and 5911714. Ding’s unified framework treats the contribution of the nodes equally as a homogenous system [34]; Chen et al developed a framework calculating the weight for either homogenous or heterogeneous systems [35]. In Chen’s model, connections can have different impacts on a node.

In this paper, we describe a link-based unified weighting framework which combines the mutual reinforcement of HITS with hyperlink weighting normalization of PageRank based on Ding and Chen’s frameworks, resulting in highly efficient link-based weighted associative classifier mining from biomedical datasets without pre-assigned weight information.

Our main contributions are: 1) development of a novel link-based weighting scheme for mining biomedical datasets; 2) implementation of a novel link-based associative classifier by combining the feature weighting method, weighted association rule mining (WARM) and the CBA algorithm [5]; 3) application of this method to two important biomedical datasets.

In the following sections, the dataset, link-based feature weighting, WARM and algorithm of LAC will be discussed, followed by the application of LAC to two datasets. In the end, we present our conclusions and future work.

thumbnail
Table 3. Supports and types of itemsets (frequent or not).

https://doi.org/10.1371/journal.pone.0051018.t003

Materials and Methods

1. Data Set

LAC is applied to two datasets: a. Ames mutagenicity dataset [36], b. NCI-60 tumor cell line dataset [37]. In Ames dataset, there are 6,512 compounds provided in SMILES format and is benchmarked by SVM, Random Forests, k-Nearest Neighbors, and Gaussian Processes. The authors used 5-fold cross validation to evaluate the generated models. The area under this ROC-Curve (AUC) is utilized to assess the performance which ranges from 0.79 to 0.86. The GI50 data of NCI-60, which is the concentration of the anti-cancer drug that inhibits the growth of cancer cells by 50%, is used and processed as following. First, among the 60 tumor cell lines, IGR-OV1, MDA-MB-468 and MDA-N are removed due to too many missing values. Then, compounds having missing values are also discarded. In the final dataset, 5,937 compounds with 57 bioassay results in total are included. For the Ames dataset, if a compound is positive, it is carcinogenic; for the NCI-60, the compound is “active” only if its GI 50 is greater than 5.

2. MDL Public Keys

MDL public key set also called MACCS key set is a 166-bit string with each bit encoding a predefined chemical structure feature. MDL public keys are extensively used in biomedical research due to their relatively high performance and the one-to-one map between the structural feature and fingerprint [37], [38]. The fingerprint is computed by using the CDK [39] software package and reformatted for LAC.

3. Bio Fingerprint

Bioassay readouts have been used as features (“biospectra” or “bio fingerprint”) for data mining in several studies and produced high quality models [40], [41]. These bioactivity profiles link the potential targets with the chemical compounds and provide insights into the relationships among diseases, compounds and bioactivities. In this study, results of related bioassay analyses are used as features for the classification of chemical compounds. Each GI50 value is transformed into “active” (GI50 is greater or equal than 5) or “inactive” (GI50 is less than 5). The T-47D is used as a label class and the results from other cell lines are used as features.

thumbnail
Table 5. The rankings of chemical features from frequency and LAC.

https://doi.org/10.1371/journal.pone.0051018.t005

For each of the 6,512 compounds in Ames data, we attempt to predict whether it is carcinogenic or not based on the MDL public keys. For the 5,937 compounds in NCI 60, we first use Bio fingerprint to predict whether they are agonist or antagonist to T-47D cell line. Then, for those 3,199 compounds in the NCI-60 dataset having 2D structures available in the downloaded structure file, a hybrid fingerprint is generated by combing MDL public keys and Bio fingerprint to build models.

Let L =  (Lij) be the adjacency matrix of the web graph G =  (V,E), where V is the set of webpages and E is the set of links between them. Lij = 1 if page i links to page j and Lij = 0 otherwise. LT will be the transpose of L. If the graph is directed, the in-degree matrix Din and out-degree matrix Dout are also defined. Given vectors din = (b1, b2, …, bn)T where bj is the in-degrees of page j() and dout = (o1,o2, …, on)T where oj is the out-degrees of page j ). Din is a diagonal matrix denoted as Din = diag(din) and Dout = diag(dout).

4. HITS

In HITS, vectors x = (x1,x2,…,xn)T and y = (y1,y2,…ym)T represent the scores of authority and hub respectively. HITS defines recursive equations as following:(1)(2)Where k1 and y(0) = e, e is a vector of all 1s and x(k) denotes k-th iteration. Equation 1 tells that authoritative pages are those linked by good hub pages, and equation 2 means good hubs are pages that link to authoritative pages. It can be rewritten as:

(3)

thumbnail
Table 8. Selected Top 5 active rules using bio fingerprint.

https://doi.org/10.1371/journal.pone.0051018.t008

5. PageRank

In PageRank, given x = (x1,x2,…,xn)T, xi is the PageRank of page i; the recursive PageRank equation is defined in matrix notation as:(5)where P =  (Pij) is a stochastic matrix (the sum of every column equals to 1) with Pij = . PT can be expressed as:

(6)If considering the link-tracking jump and link-interrupt jump, the full transition probability can be written as:(7)where is the damp factor from 0 to 1.

As the way processed in SALAS, if the web graphs are transformed into bipartite graphs, the above x will be the authority score and the hub score y can be defined as:(8)(9)

thumbnail
Figure 5. The connections between chemical features and cell lines.

(Red dot means a connection to active; green solid to inactive; light gray means features associated to each other. Purple: Non-small cell lung; Red: Renal; Pink: Breast cancer; Green; Ovarian and Light blue; Melanoma.).

https://doi.org/10.1371/journal.pone.0051018.g005

Comparing the equations between HITS and PageRank (equation 1 & 2 versus 5 & 8), it is possible that a unified framework can be derived to combine advantages from both HITS and PageRank.

6. Unified Framework

If we define the LTL in equation 3 and PT in equation 5 as operation Aop (authority) and LLT in equation 4 and P in equation 8 as operation Hop (hub). The critical component of the framework is to define the new Aop and Hop. Ding’s implementations of Aop and Hop [34] are used here since it generalizes the features of HITS and PageRank and combines them together.

Chen’s model [35] divided the web pages into homogenous and heterogeneous systems so the scores of authority and hub contain the reinforcement of links from both systems. Different weights can be assigned to homogenous or heterogeneous systems to adjust the importance of their links in the final ranking. Similarly, in our case, the nodes, such as compounds, are classified as active/inactive or positive/negative thus the dataset is converted to a heterogeneous system. The relatively higher weight values can be assigned to the active/positive compounds to promote their importance in the final feature weighting.

Our link-based framework can be written as follows. a represents the “active” system and b is the “inactive” system.(10)(11)(12)

is a class factor ranging from 0 to 1 (In the case that Aop or Hop involves or , or (1-β) will be replaced by their square roots). It has impact on the accuracy and size of classifiers along with rules in the classifiers. Generally, in order to assign higher weight values to active/positive compounds, can be any value greater than 0.5. In our study, is set to 0.9.

Based on the comparison of implementations in [34], the following definitions of Aop and Hop are used.

(13)(14)

7. Associative Classification Mining

Let F = { f1, f2, …, fn} be a set of n distinct features and C be a list of classes { c1, c2,., cm}. D is a transaction/dataset over F and C. Each transaction/compound ti contains a set of items f1, f2, …fkF and cj C. The set of items here is also called itemset. A classification association rule (CAR) is an implication of the form X Y or X Y where X F and Y C. The support of the rule is the probability of transactions having both X and Y among all the presented cases. An itemset is frequent only if its support satisfies a minimum support θ. Additionally, the confidence of this rule is defined as the support of X and Y divided by the support of X which is the conditional probability Y is true under the circumstance of X. The process of discovering, pruning, ranking and selecting of CARs and applying them to classification is called associative classification.

8. Weighted Associative Classification Mining

For the weighted associative classification (WAC) [15][17], each feature fi is associated with a weight wi W = { w1, w2, …, wn}. A pair (fi, wi) is called a weighted item. Each transaction/compound is a set of weighted items plus the class type. The straightforward definition of itemset weight is:(15)W(is) is the weight of itemset and is is the itemset. The weighted support of itemset WS(is) is:(16)T is total transactions and S is all the transactions containing the itemset. In the classical associative classification, the difference of significance of items is not taken into account. It is assumed that if the itemset is frequent, then all of its subsets should be frequent as well. This principle is called downward closure property (DCP). Given the compounds C1–C6, their features and the weight of the features (Table 1 & 2), if itemset {81, 83, 84} is frequent, then all its subsets {81}, {83}, {84}, {81, 83}, {81, 84} and {83, 84} must all be frequent. However, in WAC, provided the convenient definition (equation 15 & 16), the DCP does not hold. An itemset may be frequent even though some of its subsets are not frequent which can be illustrated in the following example ( = 0.3). As shown in Table 3, the support of {83, 84} and {81, 83} are both 0.27 so they are not frequent.

Several frameworks are proposed to maintain the DCP property [15][22], [25]. Before introducing the framework, we define the transaction weight as:(17)t is the transaction. We then define the adjusted weighted support as:(18)The S and T are the same as above. This definition will ensure that if then since any transaction containing Y will have X. By using the AWS, the DCP will not be violated. The discovered association rules are ranked, evaluated and pruned by using CBA approach [5]. The algorithm of PageRank based associative classification is given in Figure 2 & 3.

All the computations are carried out on a PC Q6600 2.4GHz with 6G memory running on the Windows 7 64bit operating system. The classifier is implemented in C#. To explore all possible rules, the mining is performed by using the following settings: MinSup (20%) and MinConf (70%) for AMES dataset; MinSup (1%) and MinConf (0%) for NCI-60 dataset. In all experiments, the maximum length of the rules is set to 4 and the maximum number of candidate frequent itemsets is 200,000. In the AMES data set, the SVM and RELIEF weighting method are applied for comparison. SVM and RELIEF are computed using Rapidminer 5.1 [42].

9. Model Assessment and Evaluation

The classification performance is assessed using 10-fold “Cross Validation” (CV) because this approach not only provides reliable assessment of classifiers but the result can be generalized well to new data. The accuracy of the classification can be determined by evaluation methods such as error-rate, recall-precision, any label and label-weight etc. The error-rate used here is computed by the ratio of number of successful cases over total case number in the test data set. This method has been widely adopted in CBA [5], CPAR [42] and CMAR [4] assessment.

Results and Discussion

1. Comparison of Feature Weight and Rank

The comparison is performed on AMES dataset. For AMES dataset mining, the identification of features which are good for “positive” compounds are considered more preferable. So the “positive” here is treated as “active”. The weight generated by LAC is compared to that generated by frequency of the bits, SVM and RELIEF. Figure 4 shows that results of RELIEF and SVM are very similar. To confirm this, a correlation analysis is performed by SPSS 19 [43]. Table 4 shows at the 0.01 level (2-tailed), SVM and RELIEF, LAC and frequency are highly correlated as the coefficient is 0.949 and 0.958 respectively. The coefficients of SVM, RELIEF and LAC with frequency are greater than 0.75 indicating that all are correlated with frequency. Among them, LAC has the strongest correlation (0.947) with frequency. This is mainly caused by bit 3, 8, 11, 36 and 166. For bit 3, 8 and 11, since their frequencies are not 0, both LAC and frequency assign small weight values while for SVM and RELIEF the weight values are set to 0. On the contrary, the weight values of 36 and 166 are set to 0 for LAC and frequency but are not set to 0 in SVM and RELIEF. The correlation of LAC and frequency can be explained by the principle of link-based weighting–mutual reinforcement. As expected, the rank and weight of features in the LAC and frequency are different. In Table 5, all features are ordered by ascending weight. 69 features (bold) are promoted and 61 features (*) are demoted while the rest remains unchanged in LAC. Generally, higher frequency will lead to higher “authority” resulting bigger weight (Figure 4). For example, bit 135 has high weight in both frequency and LAC; bit 127 and 141 are much bigger in LAC (red data label) than in frequency (black data label) since most of their connections are “active” compounds (58.6% and 56.6% respectively). Table 5 is the rank of the features in each scheme respectively. The bigger the number, the higher the rank is and the more important the feature is. Some features (bold) have a relatively lower rank in frequency; they may get higher ranks due to the promotion from connecting to compounds having higher “rank” values. Likewise, features (*) connected to many “bad” compounds may be degraded. The promotion or demotion depends on the number and type of its connections.

2. Comparison of Accuracy of Classification

The average accuracies of frequency, LAC, RELIEF, SVM and CBA are 90.11%, 91.57%, 89.05%, 89.26% and 90.63% respectively (Table 6). The major purpose of WACM is to find more rules containing interesting items, in other word, items with higher significance, while trying to achieve high accuracy at the same time. Most of current comparisons of performance between WARM and traditional ARM are focused on time and space scalability, such as number of frequent items, number of interesting rules, execution time and memory usage [18][20], [43][45]. The results showed that the difference between WARM and ARM are minor. The comparison of WACM and traditional ACM is scant due to the lack of easily accessible weighted association classifiers. Soni et al [46] compared their WACM results with those generated by traditional ACM methods–CBA [5], CMAR [4] and CPAR [47] on three biomedical datasets, and their results showed that WACM offered the highest average accuracy. In our study, among all four weighted schemes and CBA, LAC has the highest accuracy.

3. Comparison of Classifiers

There are 10 models generated for each weighting scheme and we are interested in the comparison between the classifiers of CBA and LAC. Model 1 is used as an example and there are 30 rules in the classifier of frequency and 132 in that of LAC. Among them, 14 rules are exclusively in the frequency classifier, 116 only in LAC classifier and 16 rules are shared by both. Table 7 shows that among the top 20 rules, 11 rules are shared by both classifiers, 9 rules (*) are only in the classifier of frequency and none of the top 20 rules (bold) are included in the classifier of frequency. All rules are ordered based on the CBA definition. During the classification, the match of the new compounds starts from the first and will stop immediately as long as there is a hit. As a result, although those 11 rules are in both classifiers, they may have different impacts on the final result of classification.

4. Rule Interpretation

Our recently submitted paper [48] showed that the rules generated by associative classification based on chemical fingerprints and properties can be interpreted by chemical knowledge and shed a light on the molecule design. In this study, we focus on the analysis of association rules generated by LAC using the bio fingerprint (NCI-60 dataset). The analysis for those generated by frequency can be done in the same manner. The accuracy of both frequency and LAC are 99.93% (Table 6) and the average size of the classifier is around 350 rules.

For all ten models, the top 5 rules are the same but with different order, support and confidence. The intuitive explanation of Rule 1 in Table 8 is that if compound is inactive to MCF7 and HL60 (TB) then it will be inactive to T47D at the same time. The adjusted weighted support of this rule is 29.1% and weighted confidence is 95.9%. Among the 5,937 compounds, 1730 compounds are covered by this rule. All these cell lines in the top 5 rules fall into two categories: a) breast cancer and b) Leukemia. On one hand, it means that there are many compounds which are inactive neither to breast cancer cell lines nor to Leukemia cell lines; on the other hand, it suggests that there might be some associations between these two types of cancers. [49], [50] clustered the cell lines based on their gene expression data, their results also indicated that the cell lines in these two categories were clustered into one or their clusters were very close to each other. The association of MCF7 and T47D is not surprising as they belong to the same category–breast cancer. The rules here may also provide a potential direction of the drug resistance of breast cancer and leukemia. [50][52] discovered a novel ABC transporter, breast cancer resistance protein (BCRP). This transporter was termed breast cancer resistance protein (BCRP) because of its identification in MCF-7 human breast carcinoma cells. The drug-sensitive cells become drug-resistant cells after transfection or overexpression of BCRP. They also found that relatively high expression of BCRP mRNA were observed in around 30% acute myeloid leukemia (AML) cases and suggested a novel mechanism of drug resistance in leukemia.

A hybrid feature set integrating the chemical fingerprint and bio fingerprint is generated by combining the MDL public keys and the bio fingerprint. Since we are only interested in the compounds which are active against tumor cell lines, the “inactive” value of the bioassay is treated as a feature of “not existed” in the compound. This also helps to treat the chemical fingerprint and the bio fingerprint equally.

The average accuracy of the classification is 99.7% (Table 6). For rules in the final classifier, for example, (A, B → Active), it will be converted to (A associate Active) and (B associate Active). All the rules are transferred and plotted by Cytoscape 2.8.2 [53]. To make it clearer, nodes with degree less than 10 are removed. Figure 5 shows that generally compounds actively against MDA-MB-231/ATCC, TK-10, OVCAR-4, UACC-257, HOP-92, EKVX, NCI-H226 will also active to T-47D. Chemical features: bit 46(Br), 51 (CSO), 58 (QSQ), 65 (CN), 127 and 111 (NACH2A) are related to active or inactive depending on what other features it coexists with. There are other features which mainly related to inactive.

The top 2 rules in the classifier indicate that compounds containing phosphorus and active to MCF7 or SK-MEL-2 will be active to T-47D too (Table 9). 22 out of 23 compounds match both rule 1 and 2. Among them, the once abandoned drug NSC 280594 (triciribine) attracts much attention and undergoes phase I trial due to its potential possibility of against a common cancer-causing protein [53][55]. These rules reveal that phosphorus might be an important chemical structure for anti-cancer drugs.

Conclusions

In this paper, we describe a novel link-based feature weighting framework for datasets without pre-assigned weight information. This algorithm employs a unified framework which integrates the advantage of HITS and PageRank–the mutual reinforcement and normalized weights–to derive useful weights. It utilizes connectivity and connection type information. Combined with a weighted support scheme, it offers an effective way to find the useful associations by taking into account both the significance of occurrence and the quality of features. The latter is included by connections to the transactions.

Based on this new weight scheme, a CBA based classifier, LAC, is developed. The classifier is applied to two cases: the chemical fingerprint featured dataset and the bio-fingerprint featured dataset. Our experimental results show that although the weighting differs from the traditional RELIEF and SVM, it is able to capture the important features and afford good results. Especially for some sparse dataset, some significant features can be discovered by this link-based analysis which will be ignored by other methods.

The link-based classifier discovers interesting associations of bioactivities with chemical features and potential relationships among diseases, for instance, relationship between phosphorus and bioactivity against T47D and potential relationship between breast cancer and leukemia. Our next step will apply this method to large semantic data sets to mine information from the RDF resources such as ChEMBL [56] and KEGG [57].

Acknowledgments

We thank Prof. Bauckhage from Fraunhofer IAIS for the discussion of PageRank application on bipartite graphs. We thank all anonymous reviewers for their positive and constructive comments.

Author Contributions

Conceived and designed the experiments: PLY DW. Performed the experiments: PLY. Analyzed the data: PLY DW. Wrote the paper: PLY DW.

References

  1. 1. Thabtah F, Cowling P, Peng Y (2005) MCAR: multi-class classification based on association rule. Proceedings of the ACS/IEEE 2005 International Conference on Computer Systems and Applications: IEEE Computer Society. pp. 127–133.
  2. 2. Bouzouita I, Elloumi S, Yahia S (2006) GARC : A New Associative Classification Approach. Data Warehousing and Knowledge Discovery. pp. 554–565.
  3. 3. Thabtah F (2007) A review of associative classification mining. Knowledge Engineering Review 22: 37–65.
  4. 4. Wenmin L, Jiawei H, Jian P (2001) CMAR: accurate and efficient classification based on multiple class-association rules. Data Mining, 2001 ICDM 2001, Proceedings IEEE International Conference on. pp. 369–376.
  5. 5. Liu B, Hsu W, Ma Y (1998) Integrating Classification and Association Rule Mining. KDD’ 98: 80–86.
  6. 6. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. SIGMOD Rec 22: 207–216.
  7. 7. Becquet C, Blachon S, Jeudy B, Boulicaut J-F, Gandrillon O (2002) Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data. Genome Biology 3: 1–16.
  8. 8. Zuo J, Tang C, Zhang T (2002) Mining Predicate Association Rule by Gene Expression Programming Advances in Web-Age Information Management. In: Meng X, Su J, Wang Y, editors: Springer Berlin/Heidelberg. pp. 281–294.
  9. 9. Creighton C, Hanash S (2003) Mining gene expression databases for association rules. Bioinformatics 19: 79–86.
  10. 10. Carmona-Saez P, Chagoyen M, Rodriguez A, Trelles O, Carazo J, et al. (2006) Integrated analysis of gene expression by association rules discovery. BMC Bioinformatics 7: 54.
  11. 11. Martinez R, Pasquier N, Pasquier C (2008) GenMiner: mining non-redundant association rules from integrated gene expression data and annotations. Bioinformatics 24: 2643–2644.
  12. 12. Park S, Reyes J, Gilbert D, Kim J, Kim S (2009) Prediction of protein-protein interaction types using association rule based classification. BMC Bioinformatics 10: 36.
  13. 13. Leung K-S, Wong K-C, Chan T-M, Wong M-H, Lee K-H, et al. (2010) Discovering protein–DNA binding sequence patterns using association rule mining. Nucleic Acids Research 38: 6324–6337.
  14. 14. MacDonald NJ, Beiko RG (2010) Efficient learning of microbial genotype–phenotype association rules. Bioinformatics 26: 1834–1840.
  15. 15. Cai CH, Fu AWC, Cheng CH, Kwong WW (1998) Mining association rules with weighted items. Database Engineering and Applications Symposium, 1998 Proceedings IDEAS’98 International. pp. 68–77.
  16. 16. Tao F, Murtagh F, Farid M (2003) Weighted Association Rule Mining using weighted support and significance framework. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. Washington, D.C.: ACM. pp. 661–666.
  17. 17. Wang W, Yang J, Yu P (2004) WAR: Weighted Association Rules for Item Intensities. Knowledge and Information Systems 6: 203–229.
  18. 18. Khan MS, Muyeba M, Coenen F (2008) Weighted Association Rule Mining from Binary and Fuzzy Data. Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects. Leipzig, Germany: Springer-Verlag. pp. 200–212.
  19. 19. Kumar P, Ananthanarayana VS (2010) Discovery of weighted association rules mining. Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on. pp. 718–722.
  20. 20. Muyeba M, Khan MS, Coenen F (2009) Fuzzy Weighted Association Rule Mining with Weighted Support and Confidence Framework. In: Sanjay C, Takashi W, Shin-Ichi M, Shusaku T, Takashi O et al.., editors. New Frontiers in Applied Data Mining: Springer-Verlag. pp. 49–61.
  21. 21. Ramkumar GD, Sanjay R, Tsur S (1998) Weighted Association Rules: Model and Algorithm. Proc Fourth ACM Int’l Conf Knowledge Discovery and Data Mining.
  22. 22. Soni S, Pillai J, Vyas OP (2009) An associative classifier using weighted association rule. Nature & Biologically Inspired Computing, 2009 NaBIC 2009 World Congress on. pp. 1492–1496.
  23. 23. Jankowski N, Usowicz K (2011) Analysis of Feature Weighting Methods Based on Feature Ranking Methods for Classification. Neural Information Processing. In: Lu B-L, Zhang L, Kwok J, editors: Springer Berlin/Heidelberg. pp. 238–247.
  24. 24. Qian-Cheng W, Ng WWY, Chan PPK, Yeung DS (2010) Feature weighting based on L-GEM. Machine Learning and Cybernetics (ICMLC), 2010 International Conference on. pp. 220–224.
  25. 25. Sun K, Bai F (2008) Mining Weighted Association Rules without Preassigned Weights. IEEE Trans on Knowl and Data Eng 20: 489–495.
  26. 26. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46: 604–632.
  27. 27. Page L, Brin S, Motwani R, Winograd T (1999) The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab.
  28. 28. Kazius J, McGuire R, Bursi R (2004) Derivation and Validation of Toxicophores for Mutagenicity Prediction. Journal of Medicinal Chemistry 48: 312–320.
  29. 29. Meghabghab G, Kandel A (2008) PageRank Algorithm Applied to Web Graphs Search Engines, Link Analysis, and User’s Web Behavior. Springer Berlin/Heidelberg. pp. 69–81.
  30. 30. Bauckhage C (2008) Image Tagging Using PageRank over Bipartite Graphs. Proceedings of the 30th DAGM symposium on Pattern Recognition. Munich, Germany: Springer-Verlag. pp. 426–435.
  31. 31. Farahat A, LoFaro T, Miller JC, Rae G, Ward LA (2006) Authority Rankings from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of Initialization. SIAM Journal on Scientific Computing 27: 1181–1201.
  32. 32. Lempel R, Moran S (2001) SALSA: the stochastic approach for link-structure analysis. ACM Trans Inf Syst 19: 131–160.
  33. 33. Lempel R, Moran S (2000) The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks 33: 387–401.
  34. 34. Ding C, He X, Husbands P, Zha H, Simon HD (2002) PageRank, HITS and a unified framework for link analysis. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. Tampere, Finland: ACM. pp. 353–354.
  35. 35. Chen Z, Tao L, Wang J, Wenyin L, Ma W-Y (2002) A Unified Framework for Web Link Analysis. Proceedings of the 3rd International Conference on Web Information Systems Engineering: IEEE Computer Society. pp. 63–72.
  36. 36. Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, et al. (2009) Benchmark Data Set for in Silico Prediction of Ames Mutagenicity. Journal of Chemical Information and Modeling 49: 2077–2081.
  37. 37. Cheng T, Li Q, Wang Y, Bryant SH (2011) Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection. Journal of Chemical Information and Modeling 51: 229–236.
  38. 38. Weill N, Rognan D (2009) Development and Validation of a Novel Protein−Ligand Fingerprint To Mine Chemogenomic Space: Application to G Protein-Coupled Receptors and Their Ligands. Journal of Chemical Information and Modeling 49: 1049–1062.
  39. 39. Fliri AF, Loging WT, Thadeio PF, Volkmann RA (2005) Biological spectra analysis: Linking biological activity profiles to molecular structure. Proceedings of the National Academy of Sciences of the United States of America 102: 261–266.
  40. 40. Fliri AF, Loging WT, Thadeio PF, Volkmann RA (2005) Biospectra Analysis: Model Proteome Characterizations for Linking Molecular Structure and Biological Response. Journal of Medicinal Chemistry 48: 6918–6925.
  41. 41. Cheng T, Li Q, Wang Y, Bryant SH (2011) Identifying Compound-Target Associations by Combining Bioactivity Profile Similarity Search and Public Databases Mining. Journal of Chemical Information and Modeling 51: 2440–2448.
  42. 42. Xiaoxin Yin JH (2003) CPAR: Classification based on Predictive Association Rules. Proceedings of SDM’2003: SIAM. pp. 331–335.
  43. 43. Bingzheng W, Yuanpan Z, Feng G (2011) Mining weighted closed itemsets directly for association rules generation under weighted support framework. Communication Software and Networks (ICCSN), 2011 IEEE 3rd International Conference on. pp. 145–149.
  44. 44. Tseng VS, Wu C-W, Shie B-E, Yu PS (2010) UP-Growth: an efficient algorithm for high utility itemset mining. KDD’ 10: 253–262.
  45. 45. Li G-y, Hu Q-b (2011) A Framework for Weighted Association Rule Mining from Boolean and Fuzzy Data. Internet Technology and Applications (iTAP), 2011 International Conference on. pp. 1–4.
  46. 46. Soni S, Vyas OP (2011) Performance Evaluation of Weighted Associative Classifier in Health Care Data Mining and Building Fuzzy Weighted Associative Classifier Advances in Parallel Distributed Computing. In: Nagamalai D, Renault E, Dhanuskodi M, editors: Springer Berlin Heidelberg. pp. 224–237.
  47. 47. Xiaoxin Yin JH (2003) CPAR: Classification based on Predictive Association Rules. SDM’2003: SIAM. pp. 331–335.
  48. 48. Yu P, Wild DJ (2013) Fast Rule-Based Bioactivity Prediction Using Associative Classification Mining. Journal of Cheminformatics. In press.
  49. 49. Marx KA, O’Neil P, Hoffman P, Ujwal ML (2003) Data Mining the NCI Cancer Cell Line Compound GI50 Values: Identifying Quinone Subtypes Effective Against Melanoma and Leukemia Cell Classes. Journal of Chemical Information and Computer Sciences 43: 1652–1667.
  50. 50. Ross DD, Karp JE, Chen TT, Doyle LA (2000) Expression of breast cancer resistance protein in blast cells from patients with acute leukemia. Blood 96: 365–368.
  51. 51. Gottesman MM, Fojo T, Bates SE (2002) Multidrug resistance in cancer: role of ATP-dependent transporters. Nat Rev Cancer 2: 48–58.
  52. 52. van der Kolk DM, Vellenga E, Scheffer GL, Müller M, Bates SE, et al. (2002) Expression and activity of breast cancer resistance protein (BCRP) in de novo and relapsed acute myeloid leukemia. Blood 99: 3763–3770.
  53. 53. Garrett C, Coppola D, Wenham R, Cubitt C, Neuger A, et al. (2011) Phase I pharmacokinetic and pharmacodynamic study of triciribine phosphate monohydrate, a small-molecule inhibitor of AKT phosphorylation, in adult subjects with solid tumors containing activated AKT. Investigational New Drugs 29: 1381–1389.
  54. 54. Evangelisti C, Ricci F, Tazzari P, Chiarini F, Battistelli M, et al. (2011) Preclinical testing of the Akt inhibitor triciribine in T-cell acute lymphoblastic leukemia. Journal of Cellular Physiology 226: 822–831.
  55. 55. Yang L, Dan HC, Sun M, Liu Q, Sun X-m, et al. (2004) Akt/Protein Kinase B Signaling Inhibitor-2, a Selective Small Molecule Inhibitor of Akt Signaling with Antitumor Activity in Cancer Cells Overexpressing Akt. Cancer Research 64: 4394–4399.
  56. 56. ChEMBL. https://www.ebi.ac.uk/chembldb/(accessed January 20, 2009).
  57. 57. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2011) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research: 1–6.