Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Non-Parametric Change-Point Method for Differential Gene Expression Detection

  • Yao Wang,

    Affiliation Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Jilin, China

  • Chunguo Wu,

    Affiliations Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Jilin, China, Business School, University of Shanghai for Science and Technology, Shanghai, China, Department of Modern Physics, University of Science and Technology of China, Hefei, China

  • Zhaohua Ji,

    Affiliations Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Jilin, China, Inner Mongolia Xing'an Vocational & Technical College, Wulanhaote, China

  • Binghong Wang,

    Affiliations Business School, University of Shanghai for Science and Technology, Shanghai, China, Department of Modern Physics, University of Science and Technology of China, Hefei, China

  • Yanchun Liang

    ycliang@jlu.edu.cn

    Affiliation Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Jilin, China

Abstract

Background

We proposed a non-parametric method, named Non-Parametric Change Point Statistic (NPCPS for short), by using a single equation for detecting differential gene expression (DGE) in microarray data. NPCPS is based on the change point theory to provide effective DGE detecting ability.

Methodology

NPCPS used the data distribution of the normal samples as input, and detects DGE in the cancer samples by locating the change point of gene expression profile. An estimate of the change point position generated by NPCPS enables the identification of the samples containing DGE. Monte Carlo simulation and ROC study were applied to examine the detecting accuracy of NPCPS, and the experiment on real microarray data of breast cancer was carried out to compare NPCPS with other methods.

Conclusions

Simulation study indicated that NPCPS was more effective for detecting DGE in cancer subset compared with five parametric methods and one non-parametric method. When there were more than 8 cancer samples containing DGE, the type I error of NPCPS was below 0.01. Experiment results showed both good accuracy and reliability of NPCPS. Out of the 30 top genes ranked by using NPCPS, 16 genes were reported as relevant to cancer. Correlations between the detecting result of NPCPS and the compared methods were less than 0.05, while between the other methods the values were from 0.20 to 0.84. This indicates that NPCPS is working on different features and thus provides DGE identification from a distinct perspective comparing with the other mean or median based methods.

Introduction

When normal gene expression is exposed to radiation, virus infection, etc., it would cause gene mutation or gene abnormal activation, which probably leads to cancer arising [1]. There are observable differences between cancer and normal tissues in their expression values on single-gene level, which enables recognition of cancer-related gene from a statistical perspective.

Based on microarray gene expression profiling [2], many methods were reported aiming to detect such difference in gene expression, or normally called differential gene expression (DGE) [3], [4]. Among these methods, T-statistics is a classical and widely-used DGE detecting methods, which works on the hypothesis that all the cancer samples are over-expressed compared with the normal samples [5]. Other work has also presented meaningful results, such as empirical Bayes approach [6] (Efron 2001), mixture model approach [7] (Pan, 2003), and SAM [8] (Storey 2003). However, considering the heterogeneity of gene activation, it is reasonable to assume that DGE could only take place in a subset of cancer samples. Many methods were proposed to solve DGE detection under this assumption, such as PPST (permutation percentile separability test) [9] (Lyons-Weiler, 2004), COPA (cancer outlier profile analysis) [10], [11], OS (outlier sum) [12] (Tibshirani, 2007), ORT (outlier robust t-statistics) [13] (Wu, 2007), and MOST (maximum ordered subset t-statistics) [14] (Lian, 2008).

Most of the aforementioned methods attempt to identify the abnormal data points based on the overall percentile of the gene expression profile. However, it is reasonable to assume that the DGE detection could be achieved by searching for the change point of the gene expression profile If we consider the single-gene expression profile as a data sequence, for non-DGE sequence, there is no significant change between the data distributions of normal and cancer samples; for DGE sequence, since the gene expression is over regulated in cancer group, the data distribution of cancer and normal samples become distinctly different, which would result in a significant change point in the sequence of gene expression profiles.

Change point problem [15] was widely studied in many fields, such as atmospheric and financial analysis. There are also applications of change-point theory to the microarray analysis, for example, a change point detection model for genomic sequences of continuous measurements [16], ARTIVA formalism for topology inference of regulatory network [17], a Bayesian model for DGE patterns of the DosR regulon of Mycobacterium tuberculosis in the timing of gene induction [18]. With respect to DGE analysis, there are BRIDGE (Bayesian robust inference for differential gene expression) for DGE detection in microarrays with small sample sizes [19], and DGE detecting method LRS (likelihood ratio test) [20] (Hu, 2008).

Since a few of the currently available change-point methods deal explicitly with estimation of the number and location of change points, and moreover these methods may be somewhat vulnerable to deviations of model assumptions usually employed [16], we propose a non-parametric statistical method for DGE detection, named as NPCPS (Non-Parametric Change Point Statistics). NPCPS is based on modified Kolmogorov statistic to detect the single-change point in a data sequence [21]. This method compares the data distribution of normal and cancer group to detect the existence of possible change-point in the cancer group, and to estimate the position of change-points. Besides, as a non-parametric inferential method, NPCPS does not make assumptions about the probability distributions of the variables being assessed, and accordingly, it is not necessary to normalize the microarray data before calculating the test statistic like other parametric methods usually do. As comparison, we tested several percentile-based methods and LRS. BRIDGE was not included as it is originally designed for two-sample problem and application to larger sample size is computationally heavy. NPCPS works comfortably with large-scale dataset, and both simulation and experiment results show that NPCPS is effective for DGE detection.

Methods

Suppose are independent random variables with cumulative distribution function , and r is the change point of . Then, for distribution function of and of , there exists a value that satisfies . Otherwise, if r is not the change point, we have . The change point is also noted as(1)

Gene expression profile of a single gene could also be considered as a sequence of independent variables as below:(2)Here, contains expression values of normal samples in known distribution function , and contains expression values of disease samples. Over or under expression values in would result in a change point in .

To detect the change point, in the hypothesis test we used a modified Kolmogorov statistic (K-statistic), which evaluates the distance between two distribution functions:(3) is the empirical distribution function of , defined as:(4)where is an indicator function. is the inverse function of defined as(5)where y is a variable increasing with a fixed step that is subject to user's selection (we selected 100 in the simulation study and experiment).

Therefore, the testing procedure is defined as(6)where means round toward negative infinity.

Null hypothesis is true when , i.e. no change point detected; alternative hypothesis is true when , i.e. has a change point. is the critical value and α is the significance level. Typical values of include C(0.05) = 1.358 and C(0.05) = 1.628.

To give an estimate of change point, we define(7)and(8)Let be the estimate of , which is defined as:(9)Since the test statistic measures the difference between two distribution functions, larger indicates more significant DGE, while the positive corresponds to under-expression and the negative corresponds to over-expression, respectively.

Results and Discussion

Methods compared with NPCPS

Gene expression profile obtained from microarray data is often considered as a g×n matrix, which contains g rows of genes with their expression levels in n samples, in which normal group has n1 samples, and disease group has n2 samples. Let xij be the expression intensity of the ith gene in the jth sample of the normal group, while i = 1, 2, …, g, j = 1, 2, …, n1; let yij be the expression intensity of the ith gene in the jth sample of the disease group, while i = 1, 2, …, g, j = 1, 2, …, n2. The median of the ith gene is defined as(10)Define medix, the normal-group median of the ith gene, and mediy the cancer-group median as(11)(12)

Parametric methods for DGE in cancer subset.

There have been many parametric methods proposed based on the mean, median and median absolute deviation (MAD) of the gene expression profile, and following are some typical methods.

COPA [10], [11]: COPA first normalizes the expression data using the group mean and MAD to prevent impact to the data distribution hypothesis by outliers, then sorts the expression value and detects cancer genes though the rth percentile of the cancer group. If the MAD of the ith gene is approximated as(13)The COPA statistic is defined as(14)where qr(i) is the rth percentile of the ith gene expression values, which is subject to the user's selection.

OS [12]: OS introduced heuristic rule as an additional function, and also applied the percentile knowledge to detect DGE. OS normalizes every gene to ensure the same data scale, which is convenient for gene comparison. In OS, gene expression values greater than Q3(i)+IQR(i) and smaller than Q1(i)−IQR(i) are statistically considered as DGE. The OS statistic is defined as below.(15)The group with over expression is defined as(16)Similarly, the group with under expression DGE is defined as(17)

ORT [13]: Compared with OS, ORT is similar but uses the median of the normal group instead of the median of all data, and estimates the absolute error using the median of several groups instead of the square error as in COPA [11]. The purpose of these changes is to acquire more robust and consistent estimation. Accordingly, the estimate MAD in ORT is(18)The ORT statistic is(19)where Ci is the cancer group of the ith gene. For over expression,(20)Similarly, for under expression,(21)

MOST [14]: Gene with expression value greater than MOSTik is considered as a differential gene expression. The testing statistic MOSTik is defined as:(22)When k is unknown, the data are normalized firstly by μk and σ2k, and the MOSTik is defined as(23)

Non-parametric methods for DGE in cancer subset.

PPST [9]: As a non-parametric method, PPST compares the expression levels of thousands of genes in two sample groups, i.e. the control group (A) and the case group (B). The detection focuses on genes in group A of which the expression levels are higher than a certain percentile of group B's expression values (A>B), which is a type I error in statistics, and vice versa (B>A). There are two marks for each gene, s1 and s2. S1 is the number of samples in group A that are higher than the 95% of group B added by the number of samples in group B that are lower than the 95% of group A. S2 is defined as the opposite to s1. Considering a given gene, if the expression value in group A is higher than the 95th percentile of group B, it is considered as over expression; if the expression value is lower than that in group B, it is considered as under expression. Define the PPST statistic of each gene with over expression as:(24)PPST statistic of gene with under expression can be obtained similarly.

LRS [15]: in LRS, cancer outlier samples are viewed as coming from a distribution with higher mean expression intensity than all the normal and other cancer samples. The purpose of LRS is to test such unequal mean. For up-regulation, LRS first organizes all the samples so the non-cancer samples are arranged before the cancer samples, and the cancer samples are sorted by their expression intensities in the ascending order. Sn is the summation of the expression intensities of all the samples and the LRS statistic is as follows,(25)

Traditional methods for DGE in entire cancer group.

T-statistic [5]: this traditional method assumes that the cancer sample group is generally over or under-expressed compared with the normal samples. The t-statistic is defined as:(26)where is the sample mean of normal group expression values and is the sample mean of cancer group expression values, Si is the estimate of combined standard deviation. Differentially expressed genes are recognized when the testing statistic exceeds a certain threshold.

Monte Carlo simulation

Monte Carlo simulation can be used to evaluate the performance of a hypothesis test in terms of the ratio of Type I error, i.e. false positive rate (FPR). For each Monte Carlo simulation, NPCPS was applied to an artificial 7000-gene dataset (normal random numbers with mean = 0, standard deviation sd = 1) composed of n1 normal samples and n2 cancer samples, of which k (0<k<n2) cancer samples contained DGE simulated by adding a constant μ to the original normal random numbers. Multiple simulations were carried out according to different values of sample size n, DGE sample size k, and significance level α. The FPR (Table 1) and average estimate of change point (Table 2) were computed and the results of simulation with α = 0.01 were illustrated in Fig. 1. For data set n1 = n2 = 25 (Fig. 1A), the FPR was larger when k was smaller; FPR decreased with k increasing; when k was equal to or larger than 9, the detecting accuracy of NPCPS was sufficient to satisfy the significance level. For data set n1 = n2 = 50 (Fig. 1B), k should be not less than 9 to satisfy the significance level. The estimate of change point enhanced greatly when k increased; the estimated position became very close to the actual position at the same time as the FPR dropped below the significance level. This indicates that NPCPS is highly sensitive to left boundary and less sensitive to the right boundary, when the F2 information is not sufficient [21].

thumbnail
Figure 1. FPR and estimate of change-point position.

(A) Monte Carlo simulation results of dataset with size n1 = n2 = 25 and significance level α = 0.01. (B) Monte Carlo simulation results of dataset with size n1 = n2 = 50 and significance level α = 0.01. The x-axis is k, the number of samples in simulated dataset that contained DGE. The trend of curves in (A) and (B) was similar. Both FPR and estimate of change-point enhanced with the increasing k. When k>9, the difference between the true change-point and the estimated change-point was very small, and the FPR of NPCPS became lower than the significance level α, which indicated that the hypothesis test of NPCPS passed the Monte Carlo simulation.

https://doi.org/10.1371/journal.pone.0020060.g001

thumbnail
Table 2. Actual and average estimate of change point using NPCPS in Monte Carlo simulation.

https://doi.org/10.1371/journal.pone.0020060.t002

ROC analysis on simulated data

First, we test NPCPS (α = 0.01) and seven other methods, namely COPA, ORT, OS, MOST, T, LRS, and PPST, on normally distributed datasets (mean = 0, sd = 1) with different μ, n and k. When k was getting greater, all methods produced better ROC (Fig. 2 and Fig. 3). For μ = 2, when n = 50 (Fig. 2A–2C), NPCPS was slightly weaker than LRS, and better than the other methods; when n = 100 (Fig. 2D–2E), NPCPS was very similar to LRS, and better than the other methods. For μ = 1, NPCPS gave the best performance for both n = 50 and n = 100 datasets and different values of k (Fig. 3A–3F). This indicated that NPCPS had better sensitivity for less significant DGE compared with the other seven methods. Among the non-parametric method, PPST was not significantly better than the parametric methods, while LRS and NPCPS were continuously better than the other methods. This indicated that methods based on change-point were more effective and robust than methods based on percentile and MAD.

thumbnail
Figure 2. Selected ROC curves of normal dataset with μ = 2.

(A) n1 = n2 = 25, k = 3. (B) n1 = n2 = 25, k = 5. (C) n1 = n2 = 25, k = 9. (D) n1 = n2 = 50, k = 1. (E) n1 = n2 = 50, k = 4. (F) n1 = n2 = 50, k = 9. The x-axis is FPR, and the y-axis is TPR. The significance level α = 0.01 for NPCPS. Larger area under ROC curves indicates better sensitivity and specificity. An ROC curve along the diagonal line indicates random-guess.

https://doi.org/10.1371/journal.pone.0020060.g002

thumbnail
Figure 3. Selected ROC curves of normal dataset with μ = 1.

(A) n1 = n2 = 25, k = 6. (B) n1 = n2 = 25, k = 9. (C) n1 = n2 = 25, k = 14. (D) n1 = n2 = 50, k = 6. (E) n1 = n2 = 50, k = 9. (F) n1 = n2 = 50, k = 15. The x-axis is FPR, and the y-axis is TPR. The significance level α = 0.01 for NPCPS. Larger area under ROC curves indicates better sensitivity and specificity. An ROC curve along the diagonal line indicates random-guess.

https://doi.org/10.1371/journal.pone.0020060.g003

Second, we tested NPCPS (α = 0.01) and other seven methods on datasets generated from skew-normal (SN) distribution (Fig. 4). For different n and k, NPCPS had significantly larger area under the ROC curves compared with the other methods. By comparing Fig. 2 and Fig. 4 we can see that NPCPS was both effective for normal and skew-normal data distribution, and when k = 9 gave similarly good ROC (Fig. 2C compared with Fig. 4C, Fig. 2F compared with Fig. 4E). The other seven methods, including non-parametric methods, LRS and PPST, had inferior results when working with skew-normal data.

thumbnail
Figure 4. Selected ROC curves of skew-normal dataset.

(A) n1 = n2 = 25, mu = 2, k = 3. (B) n1 = n2 = 25, mu = 2, k = 5. (C) n1 = n2 = 25, mu = 2, k = 9. (D) n1 = n2 = 50, mu = 2, k = 4. (E) n1 = n2 = 50, mu = 2, k = 9. (F) n1 = n2 = 25, mu = 1, k = 6. (G) n1 = n2 = 25, mu = 1, k = 9. (H) n1 = n2 = 50, mu = 1, k = 9. (I) n1 = n2 = 50, mu = 1, k = 14.The x-axis is FPR, and the y-axis is TPR. The significance level α = 0.01 for NPCPS. Larger area under ROC curves indicates better sensitivity and specificity. An ROC curve along the diagonal line indicates random-guess.

https://doi.org/10.1371/journal.pone.0020060.g004

The AUC of ROC is summarized in Table 3 and 4.

thumbnail
Table 3. AUC of ROC curves of the simulation on data in normal distribution.

https://doi.org/10.1371/journal.pone.0020060.t003

thumbnail
Table 4. AUC of ROC curves of the simulation on data in skew-normal distribution.

https://doi.org/10.1371/journal.pone.0020060.t004

DGE detection in breast-cancer microarray data

The microarray data used in the experiment are provided by West [22]. In their experiment, primary breast tumors (between 1.5 and 5 cm in maximal dimension) from the Duke Breast Cancer SPORE frozen tissue bank were selected and diagnosed as invasive ductal carcinoma. In each case, a diagnostic axillary lymph node dissection was performed. The final dataset includes 49 samples, 25 samples of which have negative lymph nodes and 24 samples with positive lymph nodes, used here as normal sample and cancer sample, respectively. Gene expression profile of 7129 genes was obtained through annotation package hu6800 [23]. The original gene expression values ranged from 34 to 43053 and were initialized to the range from 3.5 to 10.7. Seven detection method (t-statistic, COPA, OS, ORT, MOST, PPST and LRS) were applied to the initialized gene expression profile, while NPCPS was applied to the original data. The calculated test statistics of these 7129 genes by these methods were sorted in descending order.

For NPCPS, C(0.05) = 1.628 was selected, which yields a detecting result of 1978 DGE genes. Fig. 5 shows the distribution of the estimated position of change-points in the expression value of these genes. We selected the first 30 genes ranked by NPCPS, and searched PubMed and other databases to confirm that whether these genes were relevant to breast cancer or other known cancers. Out of the first 30 genes identified by NPCPS, 17 have been reported as relevant to breast cancer or other cancers (as shown in Table 5 and Table 6 separately according to Dn value). The gene expression values and the change-point (CP) positions of the cancer-relevant genes are illustrated in Fig. 6 and Fig. 7. From Fig. 6 and 7, it could be seen that the estimated change-point positions could successfully locate the change in the trend of the gene expression value.

thumbnail
Figure 5. Change-point distribution of DGE genes when C(0.05) = 1.628.

(A) CP-position distribution of 989 genes with positive Dn>1.628. (B) CP-position distribution of 989 genes with negative Dn<−1.628.

https://doi.org/10.1371/journal.pone.0020060.g005

thumbnail
Figure 6. Expression value and Change-point of top-ranked DGE genes with positive Dn.

(A) PDE4B, change-point at sample 27. (B) C9, change-point at sample 27. (C) ITK, change-point at sample 27. (D) TCF3, change-point at sample 41. (E) JAG1, change-point at sample 31. (F) HMGA2, change-point at sample 16. (G) RARRES1, change-point at sample 26. (H) PRKCB, change-point at sample 15. CP position correctly locates the change in the trend of the gene expression value.

https://doi.org/10.1371/journal.pone.0020060.g006

thumbnail
Figure 7. Expression value and Change-point of top-ranked DGE genes with negative Dn.

(A) AGER, change-point at sample 27. (B) MAPK14, change-point at sample 27. (C) SLC5A5, change-point at sample 26. (D) MMP11, change-point at sample 32. (E) NCSTN, change-point at sample 43. (F) WRN, change-point at sample 27. (G) ENG, change-point at sample 27. (H) TRADD, change-point at sample 27. CP position correctly locates the change in the trend of the gene expression value.

https://doi.org/10.1371/journal.pone.0020060.g007

thumbnail
Table 5. Results and description of top-ranked genes with positive test statistic Dn.

https://doi.org/10.1371/journal.pone.0020060.t005

thumbnail
Table 6. Results and description of top-ranked genes with negative test statistic Dn.

https://doi.org/10.1371/journal.pone.0020060.t006

Moreover, by comparison among the ranking results of all eight methods based on the test statistic, it was noticed that genes top-ranked by NPCPS were ranked considerably lower by other methods, most of which were mean and median based parametric methods. When inspecting all the 7129 genes, the overall trend in ranking difference between NPCPS and other methods became more obvious. Table 7 shows the pair-wise linear correlation of gene ranking among the six methods. For NPCPS, the positive correlation is below 0.007 with OS, and the negative correlation below 0.047 with other methods. This indicated that NPCPS had much less correlation with the other five methods, among which the correlations were all positive and valued around 0.5 with each other. This will be further discussed in the following section. If NPCPS is combined with other methods, it would help to identify genes which are considered as less DGE significant by the other seven methods.

thumbnail
Table 7. Ranking relevance between each DGE detecting methods.

https://doi.org/10.1371/journal.pone.0020060.t007

Discussions on the biological significance of NPCPS

Non-parametric statistics.

As a non-parametric statistics, NPCPS does not rely on assumptions that the data are drawn from a given probability distribution. It is applicable to input data derived from various types distributions and doesn't require data pre-processing. As such it is the opposite of parametric statistics, which would have inferior performance when the input data are not in the assumed distribution, as in the ROC simulation on normal and skew-normal datasets (Fig. 4).

No restriction on both over expression and under expression.

The gene expression profile generated from microarray data usually contains samples of thousands of genes. Genes in the cancer samples might be over or under expressed. Majority of the DGE detecting methods have different formulas for under expressed and over expressed genes, respectively. For example, OS and ORT use different percentile values for over-expression and under-expression, respectively, and apply both formulas to the same microarray data. If over expression formula is applied to under expressed data, the DGE can not be correctly recognized. However, the detected results might contains false alarms, since both over-expression and under-expression formulas are applied to the same gene, and might be detected as DGE significant for twice. Unlike the other methods, NPCPS works for both types of DGE by using the same calculating formula, which would reduce the FDR, and do not require further analysis and computation aiming to clean the false alarms. When over expression formula was applied to under-expressed gene data (Fig. 8A), and vice versa (Fig. 8B), NPCPS presented stable performance in both situations, while other compared methods gave inferior ROC curves. According to the characteristic of ROC, T and MOST could have good ROC if the prediction result was inversed. The ROC curves of LRS were in the zone of random guess, which was close to the line-of-no-discrimination. Using LRS for under-expresson, user could turn under-expression into over-expression by inversing the dataset. This indicated that when over-expression formula of LRS was applied to under-expression, the random detecting result would be given.

thumbnail
Figure 8. ROC curves of NPCPS and other methods when inappropriate formula applied.

(A) Over expression formula applied to simulated under expressed gene, n1 = n2 = 25, μ = −2, k = 8. (B) Under expression formula applied to simulated over expressed gene, n1 = n2 = 25, μ = 2, k = 8. The x-axis is FPR, and the y-axis is TPR. ▽ is T, × is COPA, ○ is OS, • is ORT, ◊ is MOST, dotted line is LRS, dashed line is PPST, and solid line is NPCPS. The significance level α = 0.01 for NPCPS. NPCPS maintained the same level of sensitivity when applied to both types of simulated over-expressions. The other methods were not able to give results as good as when appropriate functions were applied as in Fig. 2 and Fig. 3.

https://doi.org/10.1371/journal.pone.0020060.g008

Estimated change point position:

The biological meaning of lies in that once the position of change point is estimated or located, we can identify which sample contains DGE. Then, rather than identifying DGE existence in n = n1+n2 samples on the single gene level, we can learn that, for one sample containing thousands of genes, how many genes were over expressed or under expressed. This statistical information can be used to analyze features of each sample, and the results of which could be applied to the estimation of the differentiation degree of cancer in different development stages.

Distance between two distribution function: Dn

NPCPS results showed that, among the 7219 genes, 3608 had negative Dn, while the rest 3521 had positive Dn. NPCPS use Dn to evaluate the change in distribution between normal and cancer samples, and directly measure the DGE type as either over expressed or under expressed. This feature is valid based on the expression value in Fig. 6 and 7, where Fig. 6 (positive Dn) shows typical under expression and Fig. 7 (negative Dn) shows typical over expression. Fig. 9 and Fig. 10 can illustrate the relationship between Dn and DGE in a more intuitive manner where cumulative data distributions of several typically ranked genes are given.

thumbnail
Figure 9. Data distributions of genes top-ranked by NPCPS.

(A) I1GAP1: rank 19, positive Dn. (B) PIP5K1B: rank 20, positive Dn. (C) UBB: rank 21, negative Dn. (D) RFC1: rank 22, negative Dn. Top-ranked genes by NPCPS had significant difference between the data distributions of cancer and normal groups. By comparing the empirical distribution of cancer and normal samples, (A) and (B) had distributions of cancer group that were significantly left to the distribution of normal group, which demonstrated under-expression; (C) and (D) had distributions of cancer group that were significantly right to the distribution of normal group, which demonstrated over expression. The distribution curves were consistent with the biological significance of Dn value.

https://doi.org/10.1371/journal.pone.0020060.g009

thumbnail
Figure 10. Data distribution of genes bottom-ranked by NPCPS.

(A) SLC6A8: rank 500. (B) HLF: rank 1000. (C) ATP5F1: rank 2000. (D) HLA-H: rank 3000. (E) ODF3B: rank 4000. (F) SLC20A2: rank 5000. (G) SGSH: rank 6000. (H) CXCR2: rank 7000. From the empirical data distribution, the differences between cancer and normal groups in (A)–(D) were very small, which corresponded with the Dn value.

https://doi.org/10.1371/journal.pone.0020060.g010

Genes in Fig. 9 were ranked on the top by NPCPS, where Fig. 9A and 9B are corresponding to positive Dn (under-expression), 9C and 9D are corresponding to negative Dn (over-expression), respectively. By comparing the empirical distribution of cancer and normal samples, in Fig. 9A and 9B, cancer group was significantly left to the normal group, which demonstrated under expression; in Fig. 9C and 9D, the cancer group was significantly right to the normal group, which demonstrated over-expression. The distribution graph was consistent with the Dn value.

Genes in Fig. 10 were ranked lower by NPCPS. We can find that the cumulative distance between the data distributions of normal and cancer group is generally smaller compared with those genes top-ranked by NPCPS. From the empirical data distribution, difference between cancer and normal groups were very small.

As comparison, Fig. 11A–11J shows the data distributions of those top-ranked genes by the parametric methods, and Fig. 12A–12D by LRS and PPST. The data distributions were more similar to genes that were bottom-ranked by NPCPS in that small percent of the samples bring significant increase to data range. These few samples would greatly impact the cancer-group mean or median, which consequently result in a high test statistic of parametric methods. For example, in Fig. 11B and 12B, 96% of the two curves were close to each other while 4% data points in the normal group valued much greater, which equals to one outlier sample out of the 25 normal samples. Considering that the outliers were in the normal group, it was reasonable to assume that these outliers might be caused by microarray noise. For the rest of Fig. 11 and 12, except for T-statistic, the cancer group had one outlier. Fig. 11 and 12 indicate that the comparing methods are sensitive to significant change in mean and median, even when the change is introduced by a single sample which might be outliers. NPCPS is less prone to report a DGE as such few outliers are not sufficient to produce a large Dn.

thumbnail
Figure 11. Data distributions of genes top-ranked by five parametric methods.

(A) Gene TFF1, COPA rank: 8, NPCPS rank: 5281. (B) Gene ERBB2, COPA rank: 18, NPCPS rank: 6256. (C) Gene ZNF44, OS rank: 6, NPCPS rank: 7057. (D) Gene RGS2, OS rank: 24, NPCPS rank: 6222. (E) Gene TCN1, ORT rank: 7, NPCPS rank: 7113. (F) Gene TRAPPC10, ORT rank: 9, NPCPS rank: 6732. (G) Gene CLDN10, T rank: 22, NPCPS rank: 1107. (H) Gene CSNK1E, T rank: 24, NPCPS rank: 1163. (I) Gene COL2A1, MOST rank: 10, NPCPS rank: 7126. (J) Gene SCG2, MOST rank: 10, NPCPS rank: 2585. Top-ranked genes by the five parametric methods did not have significant difference between the data distributions of cancer and normal groups.

https://doi.org/10.1371/journal.pone.0020060.g011

thumbnail
Figure 12. Data distributions of genes top-ranked by two non-parametric methods.

(A) Gene MGP, LRS rank: 11, NPCPS rank: 7049. (B) Gene IGF2, LRS rank: 21, NPCPS rank: 7094. (C) Gene TNNC1, PPST rank: 21, NPCPS rank: 3697. (D) Gene E4F1, PPST rank: 47, NPCPS rank: 2595. Top-ranked genes by the two non-parametric methods did not have significant difference between the data distributions of cancer and normal groups.

https://doi.org/10.1371/journal.pone.0020060.g012

In summary, NPCPS is less sensitive to right boundaries and tends to find genes that have greater cumulative distance between the data distribution of normal and cancer groups. For such genes, the samples in normal and cancer group may have the same data range but should have very different distributions. Therefore, the detecting result of NPCPS would be different from other compared methods, which are more sensitive to outliers that influence the data range, rather than the cumulative distance between distributions. In other words, NPCPS values continuous change in data distribution over the whole data range, while the other methods look for a significant change of mean or median. This would explain the low correlation between NPCPS and other methods.

Conclusion

A non-parametric statistical method, NPCPS, was proposed for DGE detection based on change-point theory. NPCPS uses the data distribution of normal and cancer samples as the only input to detect a change point that indicates DGE, in order to identify potential cancer genes. Distribution-based NPCPS does not require data pre-initialization and is computationally efficient compared with other median-based parametric methods. Contrast to the compared methods, NPCPS deals with both over-expression and under-expression by the same equation. Another unique feature is that the proposed NPCPS could estimate both the number and the location of cancer samples with DGE could be estimated. Simulation study and experiments showed that, the proposed NPCPS method had better reliability and accuracy; NPCPS was more effective than the compared parametric methods; similar ROC curves was given compared with LRS when sample size was larger; when the simulated DGE value was smaller, i.e. DGE was less significant, NPCPS had better sensitivity compared with the other seven methods. Simulations also indicated that, for cancer subgroup with size greater than 8, the NPCPS had FPR less than 0.01. Besides, the detection results of NPCPS had very low correlation with the compared methods, both parametric and non-parametric, which indicates that NPCPS provides meaningful detection results different from other methods. Since cancer samples could be categorized according to different stages in the cancer development, DGE detection can also be considered also a multi-class problem. Further effort could be focused on the multi-change-point in the distribution of microarray gene expression profile.

Author Contributions

Conceived and designed the experiments: YW ZJ CW YL. Performed the experiments: YW ZJ. Analyzed the data: YW CW ZJ YL. Contributed reagents/materials/analysis tools: YW CW ZJ BW YL. Wrote the paper: YW CW ZJ YL.

References

  1. 1. Tibshirani R, Hastie T (2007) Outlier sums for differential gene expression analysis. Biostatistics 8: 2–8.
  2. 2. Magic Z, Radulovic S, Brankovic-Magic M (2007) cDNA microarrays: identification of gene signatures and their application in clinical practice. Journal of B.U.ON 12: Suppl 1S39–44.
  3. 3. Brent R (2000) Genomic biology. Cell 100: 169–183.
  4. 4. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, et al. (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24(3): 227–235.
  5. 5. Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, et al. (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 100: 8418–8423.
  6. 6. Efron B, Tibshirani R, Storey J, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96: 1151–1160.
  7. 7. Pan W, Lin J, Le C (2003) A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics 3(3): 117–124.
  8. 8. Storey JD, Tibshirani R (2003) SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays, in the Analysis of Gene Expression Data: Methods and Software,. In: Parmigani G, Garrett ES, Irizarry RA, Zeger SL, editors. New York: Springer.
  9. 9. Lyons-Weiler J, Patel S, Becich MJ, Godfrey TE (2004) Tests for finding complex patterns of differential expression in cancers: towards individualized medicine. BMC Bioinformatics 5: 110.
  10. 10. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, et al. (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in Prostate Cancer. Science 10(310): 644–648.
  11. 11. MacDonald JW, Ghosh D (2006) COPA-cancer outlier profile analysis. Bioinformatics 22: 2950–2951.
  12. 12. Tibshirani R, Hastie T (2007) Outlier sums for differential gene expression analysis. Biostatistics 8: 2–8.
  13. 13. Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8(3): 566–575.
  14. 14. Lian H (2008) MOST: detecting cancer differential gene expression. Biostatistics 9(3): 411–418.
  15. 15. Basseville M, Niki forov IV (1993) Detection of Abrupt Changes: Theory and Application. Englewood Cliffs: Prentice Hall.
  16. 16. Muggeo V, Adelfio G (2011) Efficient change point detection for genomic sequences of continuous measurements. Bioinformatics 27: 161–166.
  17. 17. Lèbre S, Becq J, Devaux F, Stumpf M, Lelandais G (2010) Statistical inference of the time-varying structure of gene-regulation networks. BMC System Biology 4: 130.
  18. 18. Zhang Y, Hatch KA, Wernisch L, Bacon J (2008) A Bayesian change point model for differential gene expression patterns of the DosR regulon of mycobacterium tuberculosis. BMC Genomics 9: 87.
  19. 19. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE (2006) Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62: 10–18.
  20. 20. Hu JH (2008) Cancer outlier detection based on likelihood ratio test. Bioinformatics 24(19): 2193–2199.
  21. 21. Tan ZP, Miao BQ (2000) Nonparametric Statistical Inference for Distribution Change Point Problems. Journal of China University of Science and Technology 6(3): 270–276.
  22. 22. West M, Blanchette C, Dressman H, Huang E, Ishida S, et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America 98: 11462–11467.
  23. 23. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology vol 5: R80.