Skip to main content
Advertisement
  • Loading metrics

Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level

  • Xinge Jessie Jeng ,

    Contributed equally to this work with: Xinge Jessie Jeng, Zhongyin John Daye

    Affiliation Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America

  • Zhongyin John Daye ,

    Contributed equally to this work with: Xinge Jessie Jeng, Zhongyin John Daye

    Affiliation Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona, United States of America

  • Wenbin Lu,

    Affiliation Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America

  • Jung-Ying Tzeng

    jytzeng@stat.ncsu.edu

    Affiliations Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America, Department of Statistics, National Cheng-Kung University, Tainan, Taiwan

Abstract

Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information.

Author Summary

Next-generation sequencing technologies have allowed genetic association studies of complex traits at the single base-pair resolution, where most genetic variants have extremely low mutation frequencies. These rare variants have been the focus of modern statistical-computational genomics due to their potential to explain missing disease heritability. The identification of individual rare variants associated with diseases can provide new biological insights and enable the precise delineation of disease mechanisms. However, due to the extreme rarity of mutations and large numbers of variants, significances of causative variants tend to be mixed inseparably with a few noncausative ones, and standard multiple testing procedures controlling for false positives fail to provide a meaningful way to include a large proportion of the causative variants. To address the challenge of detecting weak biological signals, we propose a novel statistical procedure, based on false-negative control, to provide a practical approach for variant inclusion in large-scale sequencing studies. By determining those variants that can be confidently dispatched as noncausative, the proposed procedure offers an objective selection of a modest number of potentially causative variants at the single-locus level. Results can be further prioritized or used to infer disease-associated genes with annotation information.

This is a PLOS Computational Biology Methods paper.

Introduction

Recent advances in next-generation sequencing (NGS) technologies have extended the focus of genetic studies of complex traits from that of common to rare variants. Having low minor allele frequencies (MAFs), usually defined to be less than 1% to 5%, rare variants are often evolved from recent mutations that have not yet been subjected to the pruning mechanism of natural selection and can potentially retain a larger proportion of inheritable variability than common variants. [15] Recent studies have already implicated the relevance of rare variants on several complex traits. [613]

Despite its potential to uncover genetic factors contributing to missing disease heritability, the analysis of rare variants association studies bears fundamental challenges. As only a small proportion of samples may carry variant alleles at each locus, associations of individual rare variants are often underpowered. [1, 14, 15] Moreover, the number of candidate variants can be extremely large in high-throughput sequencing studies, in which available multiple testing strategies may impose excessively severe corrections, preventing the selection of potentially causal variants. [16]

Recent proposals for rare variants association analysis often resort to collapsing or pooling multiple variants in a gene or pathway. Examples include the combined multivariate collapsing (CMC) [17], cohort allelic sum (CAST) [18], C-alpha [19], sum of squared scores [2023], sequence kernel association (SKAT) [24], quality-weighted multivariate score association (qMSAT) [25], and similarity-based regression (simReg) [26] tests. The strategy increases power by aggregating effects of low-frequency variants and decreasing data dimension in multiple testing. It has been successfully applied in several applications that identified functional regions that may contain potentially relevant rare variants. [1720, 2326]

Nonetheless, variants-pooling tests that aggregate over a gene or pathway do not provide information at the individual locus and are ill-equipped to tap the full potential of NGS data in identifying causative mutations at the single-nucleotide resolution. Pinpointing potentially causal variants is a critical goal of genomic studies because such information would faciliate precise delineations of molecular mechanisms and functions of genetic factors on diseases. [27] Moreover, studies have shown that pooling over multiple variants may result in reduced power, as the inclusion of many noncausal variants may dilute the effects of relevant variants on a trait. [2830] Thus, pooling over multiple variants can sometimes be inadequate for the identification of functional genomic regions.

On the other hand, analysis of individual rare variants can provide practical advantages. Information of single-variant association can be used to pinpoint a small number of potentially causal variants for follow-up studies to facilitate the precise characterization of functions via molecular modeling and genetic experimentation, which are often too expensive and time consuming to conduct for all variants in a gene. [27] Further, single-variant results can be utilized a posteriori to objectively infer disease-related genes or pathways by comparing with annotation and functional databases. [3134] This is useful as gene-level results can oftentimes be uninformative when the significance of a few causal variants are diluted by a large number of noncausal ones in the same gene. In the Results section, we will illustrate both strategies for applying single-variant results using the CoLaus data set.

Genome-wide association (GWA) studies, as the pre-eminent means for genetic discovery over the last decade, have largely relied on statistical genomic tools that can identify common variants at the individual single-nucleotide polymorphism (SNP) level. [35] Standard procedures for GWA studies evaluate each variant individually. [36, 37] Potentially causal variants are identified by multiple-testing control on significances at each locus. The simplest strategy for multiple testing utilizes the Bonferroni correction that controls family-wise error rate, or the probability of having one or more false positives. [38] However, the Bonferroni correction can often be too conservative for GWA studies under the presence of thousands of SNPs. [39] To address this issue, the false discovery rate (FDR) is often utilized that provides a more liberal criterion by controlling the expected proportion instead of the presence of false positives. [4042]

Despite being extremely successful for common variants in GWA studies [4346], procedures based on false-positive control are often underpowered in NGS studies involving rare variants (as illustrated in Fig 1). New approaches are needed to provide a meaningful way for powerful variants selection in large-scale sequencing studies. Fig 1 compares the statistical landscape of rare variants analysis in NGS studies with that of common variants in GWA studies. In GWA studies, we observe three regions of statistical inference: the Signals (“S”) region where strongly associated variants can be readily identified by controlling false positives, the Noise (“N”) region where noncausal variants can be identified by controlling false negatives, and the indistinguishable (“I”) region where causal and noncausal variants are inextricably mixed. [47, 48] We have recently developed theoretical characterizations for the three regions in high-dimensional data analysis. [49] In NGS studies with rare variants, the Signals region tends to be very narrow and can often degenerate due to extremely low MAF and high dimensionality. Consequently, few causal variants can be identified by evaluating false positives, and results can be very unstable due to random perturbations of noncausal variants.

thumbnail
Fig 1. Illustrations of regions of statistical inference for GWA and NGS studies.

The Signals (“S”), Indistinguishable (“I”), and Noise (“N”) regions are shown. False-positive control allows the selection of variants in the Signals region, whereas false-negative control selects from both the Signals and Indistinguishable regions. In NGS studies with rare variants, the Signals region often degenerates due to extremely low MAF and high dimensionality.

https://doi.org/10.1371/journal.pcbi.1004993.g001

To address the challenge of rare variants association analysis at the single-locus level, we propose the Adaptive False-Negative Control (AFNC) procedure in order to allow a large proportion of causal variants to be retained with high probability. Specifically, the AFNC applies a novel metric called the signal missing rate (Eq 2), defined as the probability of having a nontrivial proportion of false negatives among all causal variants (i.e., FN/s in Table 1), to achieve informative variant selection by controlling the signal missing rate to be small (see Methods section). That is, AFNC seeks to determine those variants that can be confidently dispatched as noncausal and identifies variants from both the Signals and Indistinguishable regions. The results can provide informative inference in NGS studies where the Signals region is very small or degenerate (Fig 1).

thumbnail
Table 1. Classifications of variants under multiple testing control.

https://doi.org/10.1371/journal.pcbi.1004993.t001

We note that this is quite different from classical methods that control false positives. For example, the Bonferroni controls for the presence of any false positives (i.e., FP ≥ 1), whereas the FDR controls for the expectation of the proportion FP/R when R > 0 (see Table 1). Neither of these involve the number of causal variants s; thus, they cannot be used for controlling the proportion of causal variants selected. On the other hand, the AFNC, based on the proportion FN/s or 1 − TP/s, allows powerful variants selection by controlling the type II error or 1 − statistical power. Although there may exist a corresponding control level for the FDR (albeit very large) that can include the variants selected by the AFNC at a given false-negative control level (see Results section), this corresponding FDR control level is not known a priori and is expected to vary haphazardly across different studies. An arbitrarily assigned FDR control level would be inefficient for controlling false negatives in NGS studies, that can over- or under-select uncontrollably depending on the size of the Noise region. A corresponding control level usually does not exist for the stringent Bonferroni selection in large-scale sequencing studies (see Results section).

The AFNC provides a general framework that can accommodate for a wide spectrum of models and test statistics, that may include biological prior knowledge and global genotype information (see Methods section). Moreover, it readily adapts to the quality of statistical tests employed. With improved quality of statistical tests, the Indistinguishable region (see Fig 1) narrows, and the AFNC can, in turn, select a smaller set of potentially causal variants. Extensive studies (see Results section) demonstrate that the AFNC can identify a modest number of potentially causal variants while avoiding a deluge of noncausal ones for follow-up analyses that focus on targeted variants. Our proposal employs recent developments in ultra high-dimensional statistical inference to derive a data-driven procedure that can readily adapt to the underlying sparsity and effect sizes of the data. [5053] It readily controls type I error rates (see Results section). In addition, it is computationally very efficient and can be applicable for whole-genome sequencing (WGS) and whole-exome sequencing (WES) studies.

Results

The AFNC provides a general framework for including a high proportion of causal variants. It can accommodate for a spectrum of models and significance tests. The procedure (detailed in the Methods section) consists of three major steps: (i) based on a given model and significance test, obtain the test statistics and their p-values for each of the d variants and order them, (ii) estimate the signal proportion among the d variants (denoted by ) using Eq 4, and (iii) compute the AFNC cut-off position by controlling the signal missing rate at level β using Eq 3 and report the top variants as potentially causal. The AFNC is designed to allow researchers to select a modest number of potential variants while encompassing the causal ones with high confidence. Below we use simulation studies and data applications to illustrate the utility of AFNC.

Simulation studies

Simulation designs.

We obtained 10,000 haplotypes for a 25Mb region simulated by COSI 1.2 (http://www.broadinstitute.org/~sfs/cosi) according to a coalescent model that emulates the linkage-disequilibrium (LD) pattern and history of the European population using default parameters. [54] For each subject i, i = 1, ⋯, n, we randomly drew two haplotypes with replacement from the 10,000 haplotypes to form its genotypes Gij across variants j = 1, ⋯, d, where we assumed an additive genetic model such that Gij ∈ {0, 1, 2} is the number of minor alleles at locus j. For an experiment with sample size n, we focused on evaluating rare variants with , where the threshold was derived from statistical theory and has been employed in providing principled demarcations of rare and common variants in recent literature. [52, 53, 55] It incorporates sample-size information of individual experiments to determine if a variant is rare. For example, a variant with 1% MAF will be considered rare in an experiment when n = 2000 and common when n = 10,000. There were at least 250,000 numbers of rare variants with for randomly generated data at sample sizes n = 1000, 2500, 5000, 7500, and 10,000. These variants were truncated to obtain subsets of the data with different numbers of total variants d in various simulation scenarios. We randomly generated phenotypes in each experiment from the Normal distribution , where s is the number of causal variants, Aj is the effect size of the jth locus, and σ is the noise level fixed at 1. We selected the first s variants as causal so that the causal variants in different simulation scenarios are nested. As in previous studies, we set the effect sizes Aj = C⋅|log10(MAFj)| for variants j = 1, …, s and 0 otherwise. [24] Thus, a continuum of effect sizes can be shown by varying the effect-size multiplier C.

The AFNC was compared with the Bonferroni and FDR controls, which are the most commonly used procedures for adjusting multiplicity in genomic studies. Bonferroni controls the family-wise type I error [38], whereas FDR controls the expected proportion of false positives among all discoveries [41]. Both essentially focus on the control of false positives with FDR being less stringent than the Bonferroni. The Bonferroni and FDR threshold levels were both set at 0.05. The AFNC threshold levels were set at a false-positive rate of α = 0.05 and a false-negative rate of β = 0.1. When estimating π in Step (ii) of AFNC (Eq 4), the cd values, obtained from Eq 5, are 0.0488, 0.0305, 0.0150, and 0.0095 for d = 10,000, 25,000, 100,000, and 250,000, respectively, based on M = 10,000 randomly generated samples under the global null hypothesis of no causal variants.

For succinct presentation, we compared the AFNC with the Bonferroni and FDR using the Wald test. In the following, we illustrate that the AFNC can perform well, even though significance rankings based on the Wald test may not be optimal. Performances were comprehensively evaluated via sensitivity, specificity, and g-measure [56], and success rates of inclusion of a high proportion of causal rare variants. Sensitivity is defined as the proportion of causal variants that were correctly identified and provides the empirical power for s > 0 causal variants. Specificity is the proportion of noncausal variants that were correctly rejected. Under the global null hypothesis when all variants are noncausal (i.e., C = 0), the empirical type I error rate or false-positive rate is defined as 1—specificity. The g-measure, defined as , is a composite performance measure of overall variant selection. [56, 57] A g-measure close to 1 indicates accurate variant selection, and a g-measure close to 0 implies that few causal variants or too many noncausal ones are selected, or both. Each experimental scenario was randomly simulated 100 times. Median results are shown for sensitivity, specificity, and g-measure, whereas success rates of inclusion of at least a given proportion of causal variants were computed based on the 100 repetitions.

Comparison across different effect sizes and numbers of variants.

We evaluated performances across varying numbers of total variants d and effect-size multipliers C. We considered s = 50 variants, which are causal when C ≠ 0. Experiments were conducted with n = 2000 number of samples.

Fig 2 presents results of sensitivity, specificity, and g-measure. The AFNC consistently dominates the FDR and Bonferroni across numbers of variants d and effect-size multipliers C in terms of sensitivity or empirical power for C ≠ 0. Success rates of including at least a given proportion of the s causal variants are presented in S1 Fig. AFNC successfully selects at least 75% of causal variants when C is relativley large, whereas FDR and Bonferroni usually cannot select a large proportion of causal variants, especially for d large. In fact, the Bonferroni fails to select more than 75% of causal variants in all scenarios. This suggests the advantage of considering false-negative control procedures over false-positive ones for including causal rare variants.

thumbnail
Fig 2. Comparisons across varying effect sizes and numbers of variants at s = 50.

Performance of AFNC, FDR, and Bonferroni is evaluated in terms of sensitivity, specificity, and g-measure. Results are shown for s = 50 number of causal variants when C ≠ 0 and n = 2000 number of samples.

https://doi.org/10.1371/journal.pcbi.1004993.g002

AFNC underperforms the Bonferroni and FDR in terms of specificity in Fig 2. Nonetheless, AFNC consistently dominates the Bonferroni and FDR in terms of overall performances with the g-measure, especially at d large. This suggests that the AFNC can improve overall variant-selection performance in large-scale sequencing studies. Specifically, the AFNC, at the cost of mildly increased but controlled false positives, provides dramatic reduction in the number of candidate variants while retaining a high proportion of causal ones for follow-up analysis. However, variant screening with the AFNC comes with a cost. Although AFNC selects a small proportion of variants, the actual number of selected variants can be large in high dimensions, which can result in severely lower precision (i.e., the proportion of true positives among those selected, TP/R) compared with the Bonferroni and FDR.

Table 2 presents empirical type I error rates at the global null hypothesis C = 0 when no variants are causal. The AFNC is shown to control type I error rates well at below α = 0.05. This is due to the adaptivity of the AFNC procedure that allows it to accommodate for varying proportions of causal variants (see Methods section). On the other hand, Bonferroni and FDR have type I error rates at 0, suggesting them to be much too conservative for rare-variant association studies.

thumbnail
Table 2. Empirical type I error rates across varying numbers of variants.

https://doi.org/10.1371/journal.pcbi.1004993.t002

We repeated the same evaluation with s = 25 variants, which are causal when C ≠ 0. Results are presented in S2 Fig (for sensitivity, specificity, and g-measure) and S3 Fig (for success rates of inclusion). The relative performance among AFNC, FDR, and Bonferroni is similar to what is observed for s = 50.

Comparison across different sample sizes and numbers of causal variants.

We compared performances across different sample sizes n and numbers of causal variants s. An effect-size multiplier C = 0.5 is considered at d = 100,000 total number of variants.

Fig 3 shows that the AFNC consistently outperforms the FDR and Bonferroni across numbers of causal variants s and sample sizes n in terms of sensitivity or empirical power. Success rates of inclusion are shown in S4 Fig, where the AFNC can select at least 75% of causal variants for sample size n large. The FDR and Bonferroni usually select a small proportion of causal variants with the Bonferroni consistently selecting less than 50% of causal variants in nearly all scenarios. Due to low MAFs, selection of causal variants is more difficult for rare variants at small sample sizes. For example, at n ≤ 2500, the procedures usually cannot identify more than 90% of all causal variants. Fig 3 shows that the AFNC dominates the FDR and Bonferroni for overall variant selection in terms of g-measure with underperformance in terms of specificity. Moreover, S1 Table presents empirical type I error rates at varying sample sizes n, where the AFNC is shown to control type I error rates at 0.05 while the FDR and Bonferroni are overwhelmingly over-conservative with type I error rates at 0. S5 and S6 Figs further present results at C = 0.25, where the AFNC is shown to be even more advantageous at smaller effect sizes.

thumbnail
Fig 3. Comparisons across varying sample sizes and numbers of causal variants at C = 0.5.

Performance of AFNC, FDR, and Bonferroni is evaluated in terms of sensitivity, specificity, and g-measure. Results are shown for the effect-size multiplier C = 0.5 and d = 100,000 number of variants.

https://doi.org/10.1371/journal.pcbi.1004993.g003

Analysis of CoLaus cardiovascular diseases dataset

We considered the Cohorte Latusannoise (CoLaus) sequence study [5861], where almost 6000 unrelated Caucasian residents of Lausanne, Switzerland were assessed for risk factors of cardiovascular diseases (CVD). Targeted sequencing genotypes on 202 drug-targeted genes (human genome build 36) were obtained for n = 1769 of these subjects. Cholesterol levels were collected for each subject to evaluate risk of CVD, along with 12 clinical factors—age, gender, and 10 ethnicity covariates using the first 10 principal components [62]. We considered d = 9665 autosomal rare variants from the sequencing study with .

For each variant, t-statistic was obtained by linear association with log cholesterol levels as the response while adjusting for the 12 clinical covariates. The AFNC, FDR, and Bonferroni were, then, applied on significances of t-statistics to identify potentially causal variants. At threshold levels of 0.05, Bonferroni and FDR only identified 4 variants. At α = 0.05 and β = 0.1, AFNC identified 56 candidate rare variants. The AFNC algorithm obtained cd = 0.0494 based on M = 10,000 randomly generated samples under the global null of no causal variants and (Eqs 4 and 5). As CVD tends to be influenced by multiple factors [63, 64] and the study focused on genes having clinical relevance, one expects a larger number of causal variants than those identified by the FDR and Bonferroni. Our estimated number of signals, , suggests that at least 18 variants need to be selected, and potentially much more due to signals dispersed in the Indistinguishable region, to encompass a high proportion of causal variants. That is, false-positive control procedures can be much too conservative in NGS studies, where the Signals region tends to be degenerate (see Fig 1). In the following, we illustrate potential applications of the AFNC for pinpointing individual variants in candidate genes and inferring disease-related genes with annotation information.

Pinpointing individual variants in candidate genes for follow-up analysis.

To obtain a set of candidate genes, we conducted gene-based analysis using the SKAT with the linear kernel and variant weights 1/MAF. [24] The SKAT performs gene-level analyses via variance component test. The SKAT with the linear kernel is equivalent to the SimReg [26] and the sum of squared scores [2023] tests. Gene-based analysis did not identify any significant gene when controlling the FDR at the 0.05 threshold level. For illustrative purposes, we focused on the top 5 genes (APH1A, TRPM8, SLC10A2, SP110, SIRT6) with gene-set p-values <0.01. These genes have been related to CVD in the literature. [6574]

Table 3 presents variants selected in the top 5 candidate genes by the AFNC, along with their p-values and annotation information. The Bonferroni and FDR only selected 2 variants, chr1_148504677 from APH1A and chr2_234559154 from TRPM8. They did not identify any variant from SP110 and SIRT6. Both are relevant genes, where SP110 has been associated with venous obstruction [67] and SIRT6 has been known for its therapeutic potential towards the prevention of CVD [7274]. Moreover, TRPM8, from which the FDR and Bonferroni only identified a single variant, regulates functions of the pulmonary artery via complex systems. [6870] No individual variants were selected from SLC10A2, whose most significant variant has a p-value of 6.32 × 10−3.

thumbnail
Table 3. Annotation of AFNC-selected variants of candidate genes in the analysis of CoLaus data.

https://doi.org/10.1371/journal.pcbi.1004993.t003

The AFNC, based on global hypothesis tests, provides an objective selection of a modest number of potentially causal variants at the single-locus level. Investigators may further prioritize variants using annotation information. For example, in Table 3, one may first target variants at non-synonymous coding and splice sites that can disrupt protein structures before analyzing 3’/5’ UTR and downstream/upstream variants that may regulate gene expression. [75] Synonymous coding and intron variants may also impact gene expression, protein folding, and fitness. [7678] Nonetheless, they are usually considered as low-priority and may represent irrelevant variants that were mixed indistinguishably with the causal ones due to extremely low MAF and high dimensionality.

Inferring disease-related genes with single-variant results.

Gene-based analysis using variants pooling can sometimes result in limited power due to the inclusion of many noncausal variants. For example, gene-set analysis using the SKAT did not identify any candidate genes in this study when controlling the FDR at the 0.05 level on gene-set p-values for risk of CVD, a multifaceted disease. To provide an alternative approach, we consider the utilization of single-variant results to infer candidate genes. Specifically, among the 56 AFNC variants, we further focus on non-synonymous and splice-site variants that are often considered as prime candidates for causal variants due to their capacity to influence protein coding and structure. [75] Table 4 presents non-synonymous and splice-site variants selected. The Bonferroni and FDR only selected a single variant, chr2_234559154 from TRPM8, whereas the AFNC selected 16 variants from 14 genes. The number of non-synonymous and splice-site variants selected by AFNC is at the same magnitude as our estimated number of causal variants . SP110 and TRPM8, that contain 2 AFNC-selected non-synonymous and splice-site variants, have been related to venous obstruction [67] and pulmonary functions [6870], respectively. Moreover, genes with a AFNC-selected non-synonymous or splice-site variant have been associated with CVD (BRD2 [79], CNR2 [8082], KCNN4 [8386], MME [87, 88], NLRP1 [89], SDHB [90], TACR3 [91, 92], TNNI3K [9395]) or related conditions, such as diabetes (CLEC16A [96]), obesity (OPRM1 [97, 98]), chronic obstructive pulmonary disease (PDE4A [99102]), and diabetic peripheral neuropathy (SCN9A [103, 104]). The full annotation of FDR- and AFNC-selected variants are shown in S2 Table.

thumbnail
Table 4. Annotation of AFNC-selected non-synonymous and splice-site variants in the analysis of CoLaus data.

https://doi.org/10.1371/journal.pcbi.1004993.t004

Comparison with Bonferroni and FDR at varying control levels.

Table 5 presents numbers of variants selected by the Bonferroni and FDR at different control levels. The Bonferroni, based on the stringent family-wise type I error rate, cannot select more than 10 variants even at the maximum control level of 1. That is, when more than 10 variants are selected, a false positive will almost surely be included with probability 1. In this particular analysis, FDR at the 0.55 control level can select the 56 variants obtained by the AFNC at α = 0.05 and β = 0.1. However, we note that the FDR control level corresponding to the AFNC is not invariant and can vary dramatically across different studies. Intuitively, a larger (or smaller) FDR control level would be needed when the Indistinguishable region is larger (or smaller) (see Fig 1), and this cannot be determined a priori.

thumbnail
Table 5. Number of variants selected in the analysis of CoLaus data at different control levels.

https://doi.org/10.1371/journal.pcbi.1004993.t005

Discussion

We have proposed a novel bioinformatic approach that allows the identification of individual rare variants in large-scale sequencing association studies. Extensive studies based on simulated data generated with COSI at realistic population parameters have been used to compare our method with the Bonferroni and FDR across various scenarios. [54] Results have suggested that the AFNC can provide informative variant selection by including a large proportion of causal variants while avoiding a deluge of noncausal ones. On the other hand, the Bonferroni and FDR are shown to be excessively over-conservative under extremely low MAFs and high dimensionality. Analyses of the CoLaus dataset for cardiovascular diseases using the AFNC have pinpointed individual variants most responsible for explaining significances of genes identified in gene-level aggregation tests. Moreover, single-variant results have been successfully applied to objectively infer potentially relevant genes when cross-referenced with annotation information. The R package ‘AFNC’ for performing the AFNC is publicly and freely available at https://github.com/zjdaye/AFNC or http://sites.google.com/site/zhongyindaye/software.

The AFNC provides a unified framework to accommodate for a wide spectrum of models, test statistics, and data scenarios. To achieve a succinct presentation, we focused on quantitative traits using p-values obtained from linear association tests in this paper. The AFNC can be easily adopted for case-control studies [2325, 105], family-structured data [106, 107], and many other scenarios. Moreover, empirical p-values, as from permutation or bootstrap, can be employed for improved significance ranking. [108] Clearly, performance results of the AFNC using p-values based on associations with quantitative traits, shown in this paper, can be extended to those obtained under a spectrum of models and data scenarios. Moreover, the analysis of large-scale genomic data is a dynamic and fast-evolving field. The AFNC, that readily adapts to the quality of statistical tests employed, will be able to provide increasingly efficient inclusion of causal variants as ever more accurate and computationally efficient means for assessing significances are developed.

A few very recent works have sought to identify individual rare variants by incorporating prior-knowledge information in statistical inference. [109, 110] These methods typically upweight individual variants predicted to be most likely to be causal based on prior GWA studies, functional annotation, sequence conservation, and other computational means. The AFNC can be readily utilized with models and test statistics that incorporate biological prior knowledge. In the Results section, we illustrated an alternative way to incorporate this bioinformatic knowledge. Specifically, we started with an agnostic interrogation of each variant and obtained a set of statistically promising variants using AFNC. We then compared the selected variants with prior-knowledge information to allow investigators to form educated hypothesis in designing follow-up studies. Statistically promising variants, that are selected objectively by AFNC, can also be explored in follow-up studies without comparing with annotation information, such as when prior knowledge is not available for novel variants or believed to be inaccurate.

Due to extremely low MAFs, rare variants do not usually exhibit strong linkage disequilibrium. [1, 111] Thus, we designed the AFNC for rare variants association studies, in which dependence among test statistics is assumed to be weak. The AFNC procedure is also applicable in the situation when causal variants are dependent, but noncausal variants are independent. [112] In other applications where noncausal genetic factors are expected to be strongly dependent, the AFNC procedure can be adapted to account for arbitrary dependence using several recent techniques for multiple testing. [113, 114]

One potential limitation of AFNC is that it may underperform when the signal intensity of the causal variants is too low. The signal intensity of a causal variant depends on the effect sizes and sample size. As shown in Figs 2 and 3, the sensitivity of AFNC deteriorates as effect size or sample size becomes smaller. Indeed, low effect sizes and small sample size are fatal limitations to all methods. In single-variant analysis of rare variants, such challenges may arise from identifying the extremely rare causal variants (e.g., singletons in the data). Although effect size is believed to be high for rare causal variants, the overall signal intensity may still be low given the extremely low sample size. Under this scenario, gene-based tests coupled with functional annotation would have better potential to identify these causal variants. Therefore, gene-based tests, functional annotation and AFNC should be used in an integrated fashion in the detection of rare causal variants: as we have illustrated in our analysis of the CoLaus data, AFNC coupled with gene-based tests can help to pinpoint potential causal variants that lead to gene-level significance; AFNC coupled with functional annotation can help to identify causal genes that are insignificant at gene level due to a few causal variants mixed with a large number of noncausal variants; finally, gene-based tests coupled with functional annotation can facilitate the identification of extremely rare causal variants.

Recent developments in the multiple testing literature have introduced the false nondiscovery rate (Fndr). [115117] We note that this is quite different from the AFNC control procedure. The Fndr controls for the expectation of the proportion FN/(dR), which do not involve the number of causal variants s (see Table 1). Moreover, this is not a sensitive measure and will be very close to zero in large-scale NGS studies, as the number of variants that are not selected dR will be very large. On the other hand, the AFNC, based on the proportion FN/s, allows robust variants selection in large-scale sequencing studies, as the number of causal variants s is expected to be small and the proportion FN/s is receptive to changes in the number of false negatives. In S7 Fig, we compared the AFNC with the Fndr at a threshold level of β = 0.1. Results suggest that the AFNC dominates the Fndr in terms of overall performances of g-measure and the Fndr performs poorly in terms of specificity.

Innovative technological advances have imposed new bioinformatic and statistical challenges by introducing genomic data at ever increasing resolution and dimensions. The proliferation of GWA studies in the last decade has largely led to the development and adaptation of the FDR as a conventional genomic tool. [4246] In this paper, we introduced the AFNC to enable the identification of rare variants in large-scale sequencing studies. It is computationally efficient for applications in WGS and WES studies and can provide informative results for investigators charged with the task of analyzing large-scale sequencing studies.

Methods

Adaptive false-negative control of individual rare variants

The proposed procedure is general and can accommodate a spectrum of models and significance tests. Suppose that we have test statistics for each variant T1(G, Z), T2(G, Z), …, Td(G, Z) based on i = 1, 2, …, n subjects, such that G = {Gij} is a matrix of vectors of genotypes across all variants j = 1, 2, …, d and Z = {Zik} is a matrix of vectors of additional covariates across various clinical factors and prior biological knowledge k = 1, …, K. Examples for Tj(G, Z) include the classical t-test statistic that depends only on genotypes of the jth variant and the local FDR statistic that utilizes genotypes across all variants in an empirical Bayes construction. [108] Further, prior knowledge from functional annotation can be incorporated, such as by using a generalized linear mixed-effects model. [110] We assume that the test statistic Tj(G, Z) for j = 1, 2, …, d is drawn from the mixture distribution (1) where π = s/d is the signal proportion, s is the number of causal variants, F0 is the null distribution of Tj(G, Z) when the jth variant is noncausal, and F1 is the alternative distribution when the jth variant is causal. [52, 53, 118] Let T(1)(G, Z) ≥ T(2)(G, Z) ≥ … ≥ T(d)(G, Z) be the ordered test statistics at decreasing significances.

To evaluate false negatives in NGS studies, we introduce the signal missing rate (SMR) for selecting the top j ranked variants as (2) where FN(j) is the number of causal variants missed by selecting the top j ranked variants and ϵ > 0 is a small constant. The SMR can be interpreted as the probability of neglecting at least a small proportion of causal variants among the top j ranked variants. By controlling the SMR, potentially causal variants can be included from both the Signals and Indistinguishable regions while dispatching with a very large number of irrelevant variants in the Noise region (see Fig 1). Compared to another possible measure of false negatives, P(FN(j)>0), SMR provides a more liberal control as it allows some, instead of zero, false negatives. SMR is also substantially different from the control of false nondiscovery rate (Fndr), which is an analog of FDR in terms of false negatives. Fndr is defined as the expectation of the proportion of false negatives among the accepted null hypotheses. [115, 119] In the analysis of data with extremely high dimensions and relatively small number of causal variants, Fndr is very close to zero and hence not an informative measure.

To provide informative analysis of rare variants in NGS studies, we propose the false-negative control screening (AFNC) procedure as follows.

  1. Obtain ordered p-values from the test statistics T(1)(G, Z) ≥ T(2)(G, Z) ≥ … ≥ T(d)(G, Z) such that p(1)p(2) ≤ … ≤ p(d).
  2. Compute an estimate of the signal proportion and compute .
  3. Retain the top variants with (3) where is the inverse cumulative distribution function of the jth ordered p-value among the null (i.e., noncausal) variants; follows the Beta distribution with parameters j and ; tα is the cut-off position of the Bonferroni procedure at α significance level, and β is a pre-fixed level for controlling the signal missing rate. We set α and β at conventional levels of 0.05 and 0.1, respectively, in this paper. Smaller value of β corresponds to more stringent control on false negatives.

Step 2. Estimating π.

To estimate the signal proportion π in Step 2, we employ the efficient estimator [50], based on empirical processes of p-values, (4) where 0 < c0 ≤ 1 is pre-fixed to accelerate the algorithm for large d by searching through only c0 fraction of the ranked variants. Conceptually, Eq (4) seeks for the largest difference between the observed, ordered p-value (i.e., p(j)) and the expected quantile under the global null (i.e., j/d). The largest difference typically occurs among the top proportion of the ranked p-values as causal variants tend to have small p-values. To ensure that we look through sufficient amount of top c0 d ordered variants (and hence the speed-up will have little impact on the results), we set a sufficiently large value for c0 d, i.e., at least 5000 or d/10, or equivalently, c0 d = max{5000, d/10}. The quantity cd > 0 is pre-computed empirically to control the Type I error rate under the global null hypothesis that no causal variants exist. Specifically, we randomly simulate M sets of p-values, , from the uniform distribution under the global null hypothesis for m = 1, 2, …, M. For set m, we order the p-values to obtain , standardize them, and compute Vm by taking the maximum, i.e., (5) Then, cd is obtained as the (1 − α) quantile of the extreme values Vm’s. Estimation of the signal proportion has been rigorously evaluated in the statistical literature. [50, 51, 120] In particular, under high dimensionality, statistical consistency of the estimator in Eq 4 does not depend on strict statistical normality assumptions and can be expected to perform well even when the proportion of causal variants π is very small. [50] It readily adapts to the underlying sparsity of the data in large-scale association studies.

Step 3. Obtaining the AFNC cut point .

The AFNC procedure evaluates statistical significance along the ordered p-values and retains the top variants of Eq 3 as important variants. When , Eq 3 simplifies to (which is ≤tα). In this case, if , the Bonferroni cut-off position tα already encompasses the estimated number of causal variants. Such scenarios occur when the effect sizes are so strong that the Indistinguishable region degenerates in Fig 1 and nearly all causal variants can be identified in the Signals region. If , all variants are expected to be noncausal, which occurs under the global null hypothesis when both the Signals and Indistinguishable regions degenerate.

The more interesting scenario of occurs in NGS studies of rare variants when the Signals region is very small or degenerates and the Indistinguishable region may ensconce causal variants. In this case, we need to search further along the ordered test statistics, bypass some of the noncausal variants in the Indistinguishable region, and then stop when the number of false negatives is small relative to the total number of causal variants. The search starts at and ends at the smallest j, , such that the observed p-values, , is no greater than the β-th quantile of the j-th ordered p-value, , among the null variants. The rationale is that when not all causal variants rank before , the number of noncausal variants among the top variants, denoted by , would be greater than j. Then the observed , which is in essence , would be greater than . In other words, is implied by the event that the top variants still do not contain all causal variants. Therefore, our search should continue until the first time . In the extremely ideal case, one would wish that . In real practice, we set by looking for the j such that is less than or equal to the β-th quantile of to achieve a better balance between a small false-negative proportion and a reasonable total number of variants selected. When this event occurs (i.e., β-th quantile of ), the AFNC threshold asymptotically controls SMRϵ at level β for an arbitrarily small constant ϵ (i.e., ϵ is not changing with the total number of variants d).

In summary, using the cut-off position , AFNC can adaptively encompass a large proportion (1 − ϵ) of the causal variants with high probability (≈1 − β). In the case where the causal and noncausal variants are better separated, of AFNC will become closer to the Bonferroni cut-off position tα. The AFNC procedure controls the signal missing rate with any consistent estimator of π (and in this paper, we employ the estimator of Eq 4). Finally, our procedure has a very low computational complexity O(d log d) and can be applied under extreme high dimensionality for WGS and WES studies.

Supporting Information

S1 Text. Derivation of signal missing rate control.

We measure the false negatives using the signal missing rate (SMR) and show that SMR for can be asymptotically controlled at level β.

https://doi.org/10.1371/journal.pcbi.1004993.s001

(PDF)

S1 Fig. Inclusion rate of causal variants across varying effect sizes and numbers of variants at s = 50.

Success rates of including at least 50%, 75%, 90%, and 95% of s variants are examined. Results are shown for s = 50 number of causal variants when C ≠ 0 and n = 2000 number of samples.

https://doi.org/10.1371/journal.pcbi.1004993.s002

(PDF)

S2 Fig. Comparisons across varying effect sizes and numbers of variants at s = 25.

Performance of AFNC, FDR, and Bonferroni is evaluated in terms of sensitivity, specificity, and g-measure. Results are shown for s = 25 number of causal variants when C ≠ 0 and n = 2000 number of samples.

https://doi.org/10.1371/journal.pcbi.1004993.s003

(PDF)

S3 Fig. Inclusion rate of causal variants across varying effect sizes and numbers of variants at s = 25.

Success rates of including at least 50%, 75%, 90%, and 95% of s variants are examined. Results are shown for s = 25 number of causal variants when C ≠ 0 and n = 2000 number of samples.

https://doi.org/10.1371/journal.pcbi.1004993.s004

(PDF)

S4 Fig. Inclusion rate of causal variants across sample sizes and numbers of causal variants at C = 0.5.

Success rates of including at least 50%, 75%, 90%, and 95% of s variants are examined. Results are shown for the effect-size multiplier C = 0.5 and d = 100,000 number of variants.

https://doi.org/10.1371/journal.pcbi.1004993.s005

(PDF)

S5 Fig. Comparisons across varying sample sizes and numbers of causal variants at C = 0.25.

Performance of AFNC, FDR, and Bonferroni is evaluated in terms of sensitivity, specificity, and g-measure. Results are shown for the effect-size multiplier C = 0.25 and d = 100,000 number of variants.

https://doi.org/10.1371/journal.pcbi.1004993.s006

(PDF)

S6 Fig. Inclusion rate of causal variants across sample sizes and numbers of causal variants at C = 0.25.

Success rates of including at least 50%, 75%, 90%, and 95% of s variants are examined. Results are shown for the effect-size multiplier C = 0.25 and d = 100,000 number of variants.

https://doi.org/10.1371/journal.pcbi.1004993.s007

(PDF)

S7 Fig. Comparisons across varying effect sizes and numbers of variants at s = 50 with the Fndr.

Performance of AFNC, FDR, and Bonferroni is compared with that of the Fndr in terms of sensitivity, specificity, and g-measure. Results are shown for s = 50 number of causal variants when C ≠ 0 and n = 2000 number of samples.

https://doi.org/10.1371/journal.pcbi.1004993.s008

(PDF)

S1 Table. Empirical type I error rates across varying sample sizes.

Standard errors are included in parentheses. Results are shown for d = 100,000 number of variants.

https://doi.org/10.1371/journal.pcbi.1004993.s009

(PDF)

S2 Table. Full annotation of AFNC-selected variants in the analysis of CoLaus data.

Gene-set p-values are computed using the SKAT. Genes are sorted in alphabetic order, and variants are sorted by their individual p-values among each gene. Variants marked with (*) are also selected by the FDR.

https://doi.org/10.1371/journal.pcbi.1004993.s010

(PDF)

S1 File. Files for simulations and analysis of CoLaus data.

File “simulation_code.R” contains R code for simulations. SNPs used to generate phenotypes at n = 2000 are included as “snp.n2000.RData”. File “CoLaus_code.R” contains R code for the analysis of CoLaus data.

https://doi.org/10.1371/journal.pcbi.1004993.s011

(ZIP)

S1 Dataset. Single-locus and gene-level p-values used in the analysis of CoLaus data.

Dataset “single_locus_pvalues.txt” contains variant-level p-values used in the analysis of the CoLaus data. Dataset “gene_level_pvalues.txt” contains gene-level p-values computed from the SKAT.

https://doi.org/10.1371/journal.pcbi.1004993.s012

(ZIP)

Acknowledgments

The authors thank Drs. Peter Vollenweider and Gerard Waeber, PIs of the CoLaus study, and Drs. Meg Ehm and Matthew Nelson, collaborators at GlaxoSmithKline for providing the CoLaus phenotype and sequence data.

Author Contributions

Conceived and designed the experiments: XJJ ZJD WL JYT. Performed the experiments: XJJ ZJD JYT. Analyzed the data: XJJ ZJD JYT. Contributed reagents/materials/analysis tools: XJJ ZJD JYT. Wrote the paper: XJJ ZJD WL JYT.

References

  1. 1. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69:124–137. pmid:11404818
  2. 2. Kryukov GV, Pennacchio LA, Sunyaev SR. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet. 2007;80:727–739. pmid:17357078
  3. 3. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40:695–701. pmid:18509313
  4. 4. Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. pmid:18987709
  5. 5. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. pmid:19812666
  6. 6. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–872. pmid:15297675
  7. 7. Cohen JC, Boerwinkle E, M TH Jr, Hobbs HH. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med. 2006;354:1264–1272. pmid:16554528
  8. 8. Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, et al. Medical sequencing at the extremes of human body mass. Am J Hum Genet. 2007;80:779–791. pmid:17357083
  9. 9. Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, et al. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet. 2007;39:513–516. pmid:17322881
  10. 10. Ji W, Foo JN, O’Roak BJ, Zhao H, Larson MG, Simon DB, et al. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet. 2008;40:592–599. pmid:18391953
  11. 11. Romeo S, Yin W, Kozlitina J, Pennacchio LA, Boerwinkle E, Hobbs HH, et al. Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J Clin Invest. 2009;119:70–79. pmid:19075393
  12. 12. Nejentsev S, Walker N, Riches D, Egholm M, Todd JA. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009;324:387–389. pmid:19264985
  13. 13. Holm H, Gudbjartsson DF, Sulem P, Masson G, Helgadottir HT, Zanon C, et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat Genet. 2011;43:316–320. pmid:21378987
  14. 14. McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. pmid:20403315
  15. 15. Ionita-Laza I, Cho MH, Laird NM. Statistial challenges in sequence-based association studies with population- and family-based designs. Statistics in Biosciences. 2013;5:54–70.
  16. 16. Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet. 2011;12:628–640. pmid:21850043
  17. 17. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–21. pmid:18691683
  18. 18. Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007;615:28–56. pmid:17101154
  19. 19. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, et al. Testing for an Unusual Distribution of Rare Variants. PLoS Genetics. 2011;7(3):e1001322. pmid:21408211
  20. 20. Chapman J, Whittaker J. Analysis of multiple SNPs in a candidate gene or region. Genet Epidemiol. 2008;32:560–566. pmid:18428428
  21. 21. Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol. 2009;33:497–507. pmid:19170135
  22. 22. Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol, in press. 2011;35:606–19.
  23. 23. Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011;89:354–67. pmid:21885029
  24. 24. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare Variant Association Testing for Sequencing Data Using the Sequence Kernel Association Test (SKAT). Am J Hum Genet. 2011;89:82–93. pmid:21737059
  25. 25. Daye ZJ, Li H, Wei Z. A powerful test for multiple rare variants association studies that incorporates sequencing qualities. Nucleic Acids Res. 2012;40:e60. pmid:22262732
  26. 26. Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, et al. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet. 2011;89:277–288.
  27. 27. Sunyaev SR. Inferring causality and functional significance of human coding DNA variants. Hum Mol Genet. 2012;21(R1):R10–17. pmid:22990389
  28. 28. Kinnamon DD, Hershberger RE, Martin ER. Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS One. 2012;7:e30238. pmid:22363423
  29. 29. Barnett I. SNP-set Tests for Sequencing and Genome-Wide Association Studies. Harvard University; 2014.
  30. 30. Pan W, Kim J, Zhang Y, Shen X, Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014;197:1081–95. pmid:24831820
  31. 31. Yuan HY, Chiou JJ, Tseng WH, Liu CH, Liu CK, Lin YJ, et al. FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic Acids Res. 2006;34:W635–W641. pmid:16845089
  32. 32. Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, de Bakker PI. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics. 2008;24:2938–2939. pmid:18974171
  33. 33. Lee PH, Shatkay H. F-SNP: computationally predicted functional SNPs for disease association studies. Nucleic Acids Res. 2008;36:D820–D824. pmid:17986460
  34. 34. Zhang K, Chang S, Cui S, Guo L, Zhang L, Wang J. ICSNPathway: identify candidate causal SNPs and pathways from genome-wide association study by one analytical framework. Nucleic Acids Res. 2011;39:W437–43. pmid:21622953
  35. 35. Hindorff LA, Junkins HA, Hall PN, Mehta JP, Manolio TA. A Catalog of Published Genome-Wide Association Studies; 2011. Available at: www.genome.gov/gwastudies. Accessed July 15, 2011.
  36. 36. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. pmid:17701901
  37. 37. Agresti A. Categorical Data Analysis. 2nd ed. Gainesville, FL: John Wiley & Sons; 2002.
  38. 38. Dunn OJ. Multiple Comparisons Among Means. J American Statistical Association. 1961;56:52–64.
  39. 39. Bush WS, Moore JH. Chapter 11: Genome-wide association studies. PLoS Comput Biol. 2012;8:e1002822. pmid:23300413
  40. 40. Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9:811–8. pmid:2218183
  41. 41. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. J Royal Stat Soc B. 1995;57:289–300.
  42. 42. Storey J. A direct approach to false discovery rates. J Royal Stat Soc B. 2002;64:479–498.
  43. 43. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100:9440–9445. pmid:12883005
  44. 44. Dudbridge F K B Gusnanto A. Detecting multiple associations in genome-wide studies. Hum Genomics. 2006;2:310–7.
  45. 45. Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006;7:781–91. pmid:16983374
  46. 46. van den Oord EJ. Controlling false discoveries in genetic studies. American journal of medical genetics, Part B, Neuropsychiatric genetics. 2008;147B:637–644.
  47. 47. Jeske D, Liu Z, Bent E, Borneman J. Classification rules that include neutral zones and their application to microbial community profiling. Communication in Statistics—Theory and Methods. 2007;36:1965–1980.
  48. 48. Drton M, Perlman MD. A SINful approach to Gaussian graphical model selection. J Statistical Planning and Inference. 2008;138:1179–1200.
  49. 49. Jeng XJ. Identification of signal, noise, and indistinguishable subsets in high-dimensional data analysis. arXiv. 2013;stat.ME:1305.0220.
  50. 50. Meinshausen M, Rice J. Estimating the proportion of false null hypotheses among a large number of independent tested hypotheses. Ann Statist. 2006;34:373–393.
  51. 51. Jin J, Cai T. Estimating the null and the proportion of non-null effects in large-scale multiple comparisons. J American Statistical Association. 2007;102:495–506.
  52. 52. Cai T, Jeng XJ, Jin J. Optimal detection of heterogeneous and heteroscedastic mixtures. J Royal Stat Soc B. 2011;73:629–662.
  53. 53. Jeng XJ, Cai T, Li H. Simultaneous Discovery of Rare and Common Segment Variants. Biometrika. 2013;100:157–172. pmid:23825436
  54. 54. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–83. pmid:16251467
  55. 55. Ionita-Laza I, Lee S, Makarov V, Buxbaum J, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet. 2013;92:841–853. pmid:23684009
  56. 56. Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J Machine Learning Technologies. 2011;2:37–63.
  57. 57. Sokolova M, Japkowicz N, Szpakowicz S. Beyond Accuracy, F-score and ROC: a Family of Discriminant Measures for Performance Evaluation. In: Sattar A, Kang BH, editors. AI 2006: Advances in Artifical Intelligence. Berlin: Springer-Verlag; 2006.
  58. 58. Firmann M, Mayor V, Vidal PM, Bochud M, Pecoud A, Hayoz D, et al. The CoLaus study: a population-based study to investigate the epidemiology and genetic determinants of cardiovascular risk factors and metabolic syndrome. BMC Cardiovasc Disord. 2008;17:8:6.
  59. 59. Nelson MR, Wegmann D, Ehm MG, Kessner D, Jean PS, Verzilli C, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337:100–104. pmid:22604722
  60. 60. Song K, Nelson MR, Aponte J, Manas ES, Bacanu SA, Yuan X, et al. Sequencing of Lp-PLA2-encoding PLA2G7 gene in 2000 Europeans reveals several rare loss-of-function mutations. Pharmacogenomics J. 2012;12:425–31. pmid:21606947
  61. 61. Warren LL, Li L, Nelson MR, Ehm MG, Shen J, Fraser DJ, et al. Deep resequencing unveils genetic architecture of ADIPOQ and identifies a novel low-frequency variant strongly associated with adiponectin variation. Diabetes. 2012;61:1297–301. pmid:22403302
  62. 62. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association. Nat Genet. 2006;38:904–909. pmid:16862161
  63. 63. Durrington P. Dyslipidaemia. Lancet. 2003;362:717–731. pmid:12957096
  64. 64. Kelly M, Semsarian C. Multiple mutations in genetic cardiovascular disease: a marker of disease severity? Circ Cardiovasc Genet. 2009;2:182–190.
  65. 65. van Loo KM, Dejaegere T, van Zweeden M, van Schijndel JE, Wijmenga C, Trip MD, et al. Male-specific association between a gamma-secretase polymorphism and premature coronary atherosclerosis. PLoS One. 2008;3(11):e3662. pmid:18987747
  66. 66. Serneels L, Dejaegere T, Craessaerts K, Horre K, Jorissen E, Tousseyn T, et al. Differential contribution of the three Aph1 genes to gamma-secretase activity in vivo. Proc Natl Acad Sci U S A. 2005;102:1719–24. pmid:15665098
  67. 67. Roscioli T, Cliffe ST, Bloch DB, Bell CG, Mullan G, Taylor PJ, et al. Mutations in the gene encoding the PML nuclear body protein Sp110 are associated with immunodeficiency and hepatic veno-occlusive disease. Nat Genet. 2006;38:620–2. pmid:16648851
  68. 68. Liu XR, Liu Q, Chen GY, Hu Y, Sham JS, Lin MJ. Down-regulation of TRPM8 in pulmonary arteries of pulmonary hypertensive rats. Cell Physiol Biochem. 2013;31:892–904. pmid:23817166
  69. 69. Fernandez JA, Skryma R, Bidaux G, Magleby KL, Scholfield CN, McGeown JG, et al. Short isoforms of the cold receptor TRPM8 inhibit channel gating by mimicking heat action rather than chemical inhibitors. J Biol Chem. 2012;287:2963–70. pmid:22128172
  70. 70. Yang XR, Lin MJ, McIntosh LS, Sham JS. Functional expression of transient receptor potential melastatin- and vanilloid-related channels in pulmonary arterial and aortic smooth muscle. Am J Physiol Lung Cell Mol Physiol. 2006;290:L1267–76. pmid:16399784
  71. 71. Out C, Dikkers A, Laskewitz A, Boverhof R, van der Ley C, Kema IP, et al. Prednisolone increases enterohepatic cycling of bile acids by induction of Asbt and promotes reverse cholesterol transport. J Hepatol. 2014;61:351–7. pmid:24681341
  72. 72. Beauharnois JM, Bolivar BE, Welch JT. Sirtuin 6: a review of biological effects and potential therapeutic properties. Mol Biosyst. 2013;9:1789–806. pmid:23592245
  73. 73. Webster KA. A sirtuin link between metabolism and heart disease. Nat Med. 2012;18:1617–9. pmid:23135512
  74. 74. Sundaresan NR, Vasudevan P, Zhong L, Kim G, Samant S, Parekh V, et al. The sirtuin SIRT6 blocks IGF-Akt signaling and development of cardiac hypertrophy by targeting c-Jun. Nat Med. 2012;18:1643–50. pmid:23086477
  75. 75. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell. 5th ed. New York: Garland Science; 2007.
  76. 76. Bailey SF, Hinz A, Kassen R. Adaptive synonymous mutations in an experimentally evolved Pseudomonas fluorescens population. Nat Commun. 2014;5:4076. pmid:24912567
  77. 77. Hunt RC, Simhadri VL, Iandoli M, Sauna ZE, Kimchi-Sarfaty C. Exposing synonymous mutations. Trends Genet. 2014;30:308–21. pmid:24954581
  78. 78. Goebels C, Thonn A, Gonzalez-Hilarion S, Rolland O, Moyrand F, Beilharz TH, et al. Introns regulate gene expression in Cryptococcus neoformans in a Pab2p dependent pathway. PLoS Genet. 2013;9:e1003686. pmid:23966870
  79. 79. Spiltoir JI, Stratton MS, Cavasin MA, Demos-Davies K, Reid BG, Qi J, et al. BET acetyl-lysine binding proteins control pathological cardiac hypertrophy. J Mol Cell Cardiol. 2013;63:175–9. pmid:23939492
  80. 80. Duerr GD, Heinemann JC, Suchan G, Kolobara E, Wenzel D, Geisen C, et al. The endocannabinoid-CB2 receptor axis protects the ischemic heart at the early stage of cardiomyopathy. Basic Res Cardiol. 2014;109:425. pmid:24980781
  81. 81. Gonzalez C, Herradon E, Abalo R, Vera G, Perez-Nievas BG, Leza JC, et al. Cannabinoid/agonist WIN 55,212-2 reduces cardiac ischaemia-reperfusion injury in Zucker diabetic fatty rats: role of CB2 receptors and iNOS/eNOS. Diabetes Metab Res Rev. 2011;1:244–54.
  82. 82. Ford WR, Honan SA, White R, Hiley CR. Evidence of a novel site mediating anandamide-induced negative inotropic and coronary vasodilatator responses in rat isolated hearts. Br J Pharmacol. 2002;1:244–54.
  83. 83. Bi D, Toyama K, Lemaitre V, Takai J, Fan F, Jenkins DP, et al. The intermediate conductance calcium-activated potassium channel KCa3.1 regulates vascular smooth muscle cell proliferation via controlling calcium-dependent signaling. J Biol Chem. 2013;288:15843–53. pmid:23609438
  84. 84. Kohler R. Single-nucleotide polymorphisms in vascular Ca2+-activated K+-channel genes and cardiovascular disease. Pflugers Arch. 2010;460:343–51. pmid:20043229
  85. 85. Toyama K, Wulff H, Chandy KG, Azam P, Raman G, Saito T, et al. The intermediate-conductance calcium-activated potassium channel KCa3.1 contributes to atherogenesis in mice and humans. J Clin Invest. 2008;118:3025–37. pmid:18688283
  86. 86. Yamaguchi M, Nakayama T, Fu Z, Naganuma T, Sato N, Soma M, et al. Relationship between haplotypes of KCNN4 gene and susceptibility to human vascular diseases in Japanese. Med Sci Monit. 2009;15:CR389–97. pmid:19644414
  87. 87. Pereira NL, Aksoy P, Moon I, Peng Y, Redfield MM, Burnett JC, et al. Natriuretic peptide pharmacogenetics: membrane metallo-endopeptidase (MME): common gene sequence variation, functional characterization and degradation. J Mol Cell Cardiol. 2010;49:864–74. pmid:20692264
  88. 88. Munagala VK, Burnett JC, Redfield MM. The natriuretic peptides in cardiovascular medicine. Curr Probl Cardiol. 2004;29:707–69. pmid:15550914
  89. 89. Garg NJ. Inflammasomes in cardiovascular diseases. Am J Cardiovasc Dis. 2011;1:244–54. pmid:22254202
  90. 90. Tang Y, Mi C, Liu J, Gao F, Long J. Compromised mitochondrial remodeling in compensatory hypertrophied myocardium of spontaneously hypertensive rat. Cardiovasc Pathol. 2014;23:101–6. pmid:24388463
  91. 91. Walsh DA, McWilliams DF. Tachykinins and the cardiovascular system. Curr Drug Targets. 2006;7:1031–42. pmid:16918331
  92. 92. Hoover DB, Chang Y, Hancock JC, Zhang L. Actions of tachykinins within the heart and their relevance to cardiovascular disease. Jpn J Pharmacol. 2000;84:367–73. pmid:11202607
  93. 93. Tang H, Xiao K, Mao L, Rockman HA, Marchuk DA. Overexpression of TNNI3K, a cardiac-specific MAPKKK, promotes cardiac dysfunction. J Mol Cell Cardiol. 2013;54:101–11. pmid:23085512
  94. 94. Wheeler FC, Tang H, Marks OA, Hadnott TN, Chu PL, Mao L, et al. Tnni3k modifies disease progression in murine models of cardiomyopathy. PLoS Genet. 2009;5:e1000647. pmid:19763165
  95. 95. Theis JL, Zimmermann MT, Larsen BT, Rybakova IN, Long PA, Evans JM, et al. TNNI3K mutation in familial syndrome of conduction system disease, atrial tachyarrhythmia and dilated cardiomyopathy. Hum Mol Genet. 2014;23:5793–804. pmid:24925317
  96. 96. Zoledziewska M, Costa G, Pitzalis M, Cocco E, Melis C, Moi L, et al. Variation within the CLEC16A gene shows consistent disease association with both multiple sclerosis and type 1 diabetes in Sardinia. Genes Immun. 2009;10:15–7. pmid:18946483
  97. 97. Fox CS, Heard-Costa NL, Wilson PW, Levy D, D’Agostino RB, Atwood LD. Genome-wide linkage to chromosome 6 for waist circumference in the Framingham Heart Study. Diabetes. 2004;53:1399–402. pmid:15111512
  98. 98. Lee KW, Abrahamowicz M, Leonard GT, Richer L, Perron M, Veillette S, et al. Prenatal exposure to cigarette smoke interacts with OPRM1 to modulate dietary preference for fat. J Psychiatry Neurosci. 2015;40:38–45. pmid:25266401
  99. 99. Decramer M, Janssens W, Miravitlles M. Chronic obstructive pulmonary disease. Lancet. 2012;379:1341–51. pmid:22314182
  100. 100. Currie GP, Butler CA, Anderson WJ, Skinner C. Phosphodiesterase 4 inhibitors in chronic obstructive pulmonary disease: a new approach to oral treatment. Br J Clin Pharmacol. 2008;65:803–10. pmid:18341675
  101. 101. Giembycz MA. Phosphodiesterase-4: selective and dual-specificity inhibitors for the therapy of chronic obstructive pulmonary disease. Proc Am Thorac Soc. 2005;2:326–33. pmid:16267357
  102. 102. Giembycz MA. Cilomilast: a second generation phosphodiesterase 4 inhibitor for asthma and chronic obstructive pulmonary disease. Expert Opin Investig Drugs. 2001;10:1361–79. pmid:11772257
  103. 103. Li QS, Cheng P, Favis R, Wickenden A, Romano G, Wang H. SCN9A Variants may be Implicated in Neuropathic Pain Associated with Diabetic Peripheral Neuropathy and Pain Severity. Clin J Pain. 2015;
  104. 104. Huang Y, Zang Y, Zhou L, Gui W, Liu X, Zhong Y. The role of TNF-alpha/NF-kappa B pathway on the up-regulation of voltage-gated sodium channel Nav1.7 in DRG neurons of rats with diabetic neuropathy. Neurochem Int. 2014;75:112–9. pmid:24893330
  105. 105. Liu DJ, Leal SM. A Novel Adaptive Method for the Analysis of Next-Generation Sequencing Data to Detect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions. PLoS Genetics. 2011;6:e1001156.
  106. 106. Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser PA, et al. SNP Set Association Analysis for Familial Data. Genet Epidemiol. 2012;36:797–810. pmid:22968922
  107. 107. Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, et al. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol. 2013;37:366–376. pmid:23529756
  108. 108. Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J American Statistical Association. 2004;99:96–104.
  109. 109. Long N, Dickson SP, Maia JM, Kim HS, Zhu Q, Allen AS. Leveraging prior information to detect causal variants via multi-variant regression. PLoS Comput Biol. 2013;9(6):e1003093. pmid:23762022
  110. 110. Ionita-Laza I, Capanu M, De Rubeis S, McCallum K, Buxbaum JD. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet. 2014;10(12):e1004729. pmid:25502226
  111. 111. Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant…or not? Hum Mol Genet. 2002;11:2417–2423. pmid:12351577
  112. 112. Logan BR, Geliazkova MP, Rowe DB. An evaluation of spatial thresholding techniques in fMRI analysis. Hum Brain Mapp. 2008;29:1379–1389. pmid:18064589
  113. 113. Fan J, Han X, Gu W. Control of the false discovery rate under arbitrary covariance dependence. J American Statistical Association. 2012;107:1019–1045.
  114. 114. Friguet C, Kloareg M, Causeur D. A Factor Model Approach to Multiple Testing Under Dependence. J the American Statistical Association. 2009;104:1406–15.
  115. 115. Genovese C, Wasserman L. Operating characteristics and extensions of the false discovery rate. J Royal Stat Soc B. 2002;64:499–517.
  116. 116. Sarkar SK. FDR-controlling stepwise procedure and their false negatives rates. J Statistical Planning and Inference. 2004;125:119–137.
  117. 117. Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinformatics. 2008;9. pmid:18613966
  118. 118. Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J Royal Stat Soc B. 2004;66:187–205.
  119. 119. Sarkar SK. False discovery and false nondiscovery rates in single-step multiple testing procedures. The Annals of Statistics. 2006;34:394–415.
  120. 120. Cai T, Jin J, Low M. Estimation and Confidence Sets For Sparse Normal Mixtures. Ann Statist. 2007;35:2421–2449.