Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Case-control studies of gene-environment interactions. When a case might not be the case

  • Iryna Lobach ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Validation, Writing – original draft, Writing – review & editing

    Iryna.lobach@ucsf.edu

    Affiliation Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California, United States of America

  • Joshua Sampson,

    Roles Conceptualization, Formal analysis, Methodology, Writing – review & editing

    Affiliation National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America

  • Alexander Alekseyenko,

    Roles Formal analysis, Writing – review & editing

    Affiliation Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America

  • Siarhei Lobach,

    Roles Formal analysis, Methodology, Writing – review & editing

    Affiliation Applied Mathematics and Computer Science Department, Belarusian State University, Minsk, Belarus

  • Li Zhang

    Roles Data curation, Formal analysis, Investigation, Writing – review & editing

    Affiliations Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California, United States of America, Department of Medicine, University of California San Francisco, San Francisco, California, United States of America, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, United States of America

Abstract

Case-control Genome-Wide Association Studies (GWAS) provide a rich resource for studying the genetic architecture of complex diseases. A key is to elucidate how the genetic effects vary by the environment, what is traditionally defined by Gene-Environment interactions (GxE). The overlooked complication is that multiple, distinct pathophysiologic mechanisms may lead to the same clinical diagnosis and often these mechanisms have distinct genetic bases. In this paper, we first show that using the clinically diagnosed status can lead to severely biased estimates of GxE interactions in situations when the frequency of the pathologic diagnosis of interest, as compared to other diagnoses, depends on the environment. We then propose a pseudo-likelihood solution to correct the bias. Finally, we demonstrate our method in extensive simulations and in a GWAS of Alzheimer’s disease.

Introduction

We are interested in using data from a case-control Genome-Wide Association Studies (GWAS) to estimate how an “environmental variable” modifies the effect of a genetic variant on a specific, pathologically defined disease state. However, the complication is that in many GWAS, the cases are a heterogenous group, where multiple distinct pathologically defined disease states have led to a common set of symptoms and a shared clinical diagnosis. In these scenarios, a genetic variant will appear to interact with the environmental variable if the genetic variant affects the pathologically defined disease state of interest and the environmental variable is related to the proportion of cases with that disease state.

The issue of heterogeneity among cases is, perhaps, most pronounced in neurologic and psychiatric disorders, where the clinically defined status is based primarily on descriptive criteria and is typically made in absence of biomarker measurements, imaging data, and biopsies. Our specific motivating study is a GWAS of late-onset Alzheimer’s disease (AD), a neurodegenerative disorder that is clinically characterized by progressive mental decline. Here, we are interested in identifying genetic variants specifically associated with a high abundance of amyloid deposits and neurofibrially tangles in the brain, which we refer to as “histopathologically defined AD.”[1] Specifically, we are interested in whether carrying the ApoE ε4 variant, which in the study is considered the “environmental variable”, modifies the effect of SNPs residing in Toll-Like Receptors (TLR) and Receptor for advanced glycation end products (RAGE) on histopathologically defined AD. Importantly, ApoE ε4 status is likely to be associated with the proportion of the GWAS cases who have histopathologically defined AD. Recent biomarker studies of AD [2] reported that 36% of ApoE ε4 non-carriers and 6% in ApoE ε4 carriers clinically diagnosed with AD do not have evidence of amyloid deposition. We provide a more detailed description of ApoE ε4, other the risk factors for AD and the heterogeneity of the disease in the Discussion section.

We are interested to test an association between single nucleotide polymorphisms (SNPs) residing in Toll-Like Receptors (TLR) and the true AD diagnosis, i.e. our goal is to identify the genetic that might have lea to amyloid plaques with associated cognitive decline. TLRs play a key role in an innate immune response to invading pathogens and are also important for triggering the adaptive immune responses. Dysregulation of human toll-like receptor function has been shown in aging [3]. Specifically to the etiology of AD, TLRs act through modification of the inflammatory state of microglia/macrophages [1]. Receptor for advanced glycation end products (RAGE) has been identified as receptor for amyloid-beta peptide [4].

There is an extensive literature on how the estimates of the main genetic effect can be biased in situations when disease status is misclassified, i.e. the clinical and pathologic diagnoses do not correspond [5]. We extend the literature by investigating the impact of misdiagnosis on estimates of the Gene-Environment interaction (GxE). In case-control studies, the effects of covariates have been traditionally assessed using logistic regression analysis [6]. Recently, however, Chatterjee and Carroll [7] noticed and proved that the assumptions of Hardy-Weinberg Equilibrium and Gene x Environment independence can be leveraged in the appropriate retrospective analyses to gain statistical efficiency. We adopt the principals derived by Chatterjee and Carroll [7] and develop a pseudo-likelihood model in settings when a case defined based on the clinical diagnosis might not be the case in terms of the true diagnosis defined pathophysiologically.

Our paper is organized as follows. First, in the Material and Methods section we present the setting, notation, and proposed pseudo-likelihood approach. Next, the Simulation Experiments section describes the simulation experiments conducted to compare the resulting performance of the proposed method with the performance of standard logistic regression using clinically defined disease. In the same section, we apply our method to the motivating study of AD. The Discussion section concludes the paper.

Materials and methods

We define G be the genotype, e.g. SNPs measured at multiple locations. Let X be the environmental variables that interact with G and let Z be other environmental variables. We assume that the genotype is independent of all environmental variables and the genotypes follows Hardy-Weinberg Equilibrium: G~Q(g,θ). If θ is the frequency of minor allele a when the major allele is A, then the Hardy-Weinberg Equilibrium model [8] according to the number of minor alleles is

We define DCL = {0, 1} be observed clinical disease status defined based on a set of symptoms. Suppose that the same set of symptoms can be caused by two distinct pathophysiologic mechanisms. Let D be the true disease status defined based on the underlying pathology, where D = 1 indicates the disease of interest, while D = 1* is the nuisance disease. For ethical and/or budgetary reasons it might not be possible to measure the underlying pathology, hence D is latent. Instead, an evaluation is performed on a subset of patients or in an external reliability study. We define τ(X) = pr(D = 1|DCL = 1,X) to be the frequency of the true diagnosis of interest within the clinically diagnosed set that varies by the environment X. We let probabilities of the clinical and true diagnoses in the population to be and πd = pr(D =d), respectively.

The clinical and true diagnoses are related , which indicates that the probabilities of the clinical diagnosis are weighted sums of frequencies of the true diagnoses. If pr(DCL = dcl|D = d,X = x,G = g) = pr(DCL = dcl|D =d), then DCL is a surrogate of D. In this setting, ; hence if there is no relationship between (X,G) and D, neither there is one between (X,G) and DCL.

We first consider a binary setting where the risk parameters are defined in terms of D = 1 vs. D = 1* and D = 0 combined. Then the risk model is defined in terms of coefficients B = (β0,βG,βX,βZ,βG×X) by (1)

In the second setting that we consider the risk model is defined separately for D = 1 vs. D = 0 in terms of B = (β0,βG,βX,βZ,βG×X) and for D = 1* vs. D = 0 in terms of by (2)

In Eq (2) B and B* might share coefficients, e.g. if .

The observed data are collected using a case-control design where genetic and environmental variables are measured after the disease status is ascertained. However, the data will be analyzed as a random sample. To facilitate this analysis, we let δ = 1 be an indicator of selection into the study and consider the imaginary Bernoulli sampling with . Define and with a parameter set Ω = (κ0,β0,βG,βX,βZ,βG×X,θ) For model (1) we define and for model (2) we define

In addition we let

Consider probability, Pr(DCL,G|X,Z,δ = 1) and define a function L(dCL,g,x,z;Ω) as follows.

(3)

The pseudo-likelihood (4) can be used in place of the likelihood function based on arguments provided in the Appendix. Define Ψ(dcl,g,x,z;Ω) to be the derivative of log{L(dcl,g,x,z;Ω)} with respect to Ω and where all expectations are taken with respect to the actual retrospective sampling scheme. Derivations shown in the Appendix demonstrate that under suitable regularity conditions there is a consistent sequence of solutions to with the following property

Remark 1: The intercept parameter is a function of the probability of disease in the population. Hence, if the probability of clinical diagnosis in the population is known or a good bound can be specified, this information can be used while estimating parameters. This cannot be done in the usual logistic regression setting.

Results

Simulation experiments

The goal of the simulation study is to examine potential differences in the effect estimates of the genetic and environmental variables in their relationship to the 1) observed clinical diagnosis using the usual logistic regression model (uLR) and pseudo-likelihood model (pMLE) [7]; and 2) to the true disease status by using our pseudo-likelihood approach (pMLE-DX) that takes into account that only a proportion of the clinically diagnosed cases have the true disease. In pMLE-DX parameters are estimated based on Eq (4). Parameters are compared by their Bias and Root Mean Squared Error (RMSE). Simulations are performed using MatLab version R2017a.

In each setting we simulate 500 datasets with n0 = n1 ∈ {1000,3000,5000,10000,50000}. We let the genotype (G) be a Bernoulli random variable with frequency 0.10 to mimic a SNP and allow its effect to follow a recessive or dominant model. We set our other parameters to be similar to the values observed in our GWAS of AD. The binary variable X = {ε4+,ε4−}, which represents the ApoE ε4 status according to presence or absence of ε4 allele that occurs in approximately 14% of the population.

The proportion of the nuisance disease within the clinical diagnosis is defined as pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|Dcl = 1,ε4+) = 0.06. The clinical diagnosis of late onset AD is defined for ages 65 and older. We simulated age (Z1) to be Bernoulli with frequency 0.50 e.g. corresponding to a median split. Sex (Z2) is Bernoulli with frequency 0.52 to reflect what we observed in the motivating data example of AD.

Setting A.

We first examine a setting when the nuisance disease and controls are equivalent in that the risk parameters are defined for the disease of interest vs. the combination of controls and nuisance disease as in Eq (1). The risk coefficients are . In this setting, the frequency of the true disease status is pr(D = 1) = 46%, pr(D = 1|ε4−) = 40%, pr(D = 1|ε4+) = 82%. Table 1 presents properties of the risk parameter estimates in the datasets with n0 = n1 = 3,000. Additionally, shown in S1 Table are studies with n0 = n1 ∈ {1000,5000,10000,50000}. When the presence of the nuisance disease is ignored (uLR, pMLE), and are biased with elevated RMSE. For example, in a study with n0 = n1 = 3,000, the bias in is -0.31 in uLR and pMLE, while the bias is reduced to 0.005 by pMLE-DX. RMSE is 0.33 in uLR and pMLE, while it is reduced to 0.12 by pMLE-DX. Similarly, bias in is 0.56 in uLR and pMLE, while pMLE-DX reduces the bias by more than half. RMSE of is 2.5x larger when the presence of the nuisance disease is ignored. Notably, estimates of and are biased in uLR and pMLE. When sample size increased, the uLR bias in decreased, e.g. the bias is 0.08 in a study with n0 = n1 = 10,000; while the bias in persisted. Across all sample sizes, is biased by approximately -0.13, whereas considering the nuisance disease nearly eliminated the bias, e.g. to -0.01 in a study with n0 = n1 = 1000.

thumbnail
Table 1. Bias and RMSE in parameter estimates when βG×ε4 ≠ 0.

https://doi.org/10.1371/journal.pone.0201140.t001

We next examine if the presence of the nuisance disease could lead us to erroneously conclude that there was a significant when βG×ε4 = 0. Here, we simulated datasets with βG×ε4 = 0. Table 2 presents estimates in a study with n0 = n1 = 3000 and S2 Table is based on studies with n0 = n1 ∈ {1000,5000,10000,50000}. Estimates of and βG×ε4 are clearly biased when the presence of the nuisance disease is ignored. For example, in a study with n0 = n1 = 3,000, pMLE-DX decreased the bias in from 0.12 in uLR and pMLE to 0.04, while RMSE remained approximately the same 0.41 vs. 0.43. Similarly, pMLE-DX reduced the bias in from -0.26 in uLR to 0.007. At the same time, the RMSE of went from 0.28 (uLR, pMLE) to 0.12 (pMLE-DX). Increasing the sample size reduced the uLR bias for e.g. the bias is 0.09 in a study with n0 = n1 = 10,000 but did not alleviate the substantial uLR bias in βε4. Across all sample sizes considered, the uLR estimates of βG are biased by approximately -0.12, while pMLE-DX reduced the bias to e.g. 0.01 in a study with 1,000 cases and 1,000 controls.

thumbnail
Table 2. Bias and RMSE in parameter estimates when βG×ε4 = 0.

https://doi.org/10.1371/journal.pone.0201140.t002

We next consider the effect of underestimating pr(D = 1*|DCL = 1,ε4+) and pr(D = 1*|DCL = 1,ε4−) in the pseudo-likelihood. Here, we simulate data using the parameters specified above, but, when fitting the pseudo-likelihood (S3 Table), set pr(D = 1*|DCL = 1,ε4−) = 0.3 and pr(D = 1*|DCL = 1,ε4+) = 0, i.e. underestimated by 6%. Naturally, this misspecification introduced bias in some of the estimates and hence increased RMSE. Estimates of βε4 were generally affected more than the estimates of the other parameters. For example, in a study with 3,000 cases and 3,000 controls, bias in increased from 0.005 to -0.66 in pMLE-DX, while RMSE went from 0.12 to 0.67. In estimates of βG×ε4, the bias increased from 0.22 to 0.32, while RMSE went up from 0.93 to 0.94. The bias in increased to -0.10 in a study with 3,000 cases and 3,000 controls, what has not reached the level of uLR where the bias is -0.12. Estimates of remained nearly unbiased with the same RMSE.

We next consider the effect of overestimating pr(D = 1*|DCL = 1,ε4+) and pr(D = 1*|DCL = 1,ε4−) in the pseudo-likelihood (S4 Table). Here, we simulate data using the parameters specified above, but, when fitting the pseudo-likelihood, set pr(D = 1*|DCL = 1,ε4−) = 0.42 and pr(D = 1*|DCL = 1,ε4+) = 0.16, i.e. overestimated by 6%. As expected, this misspecification inflated the bias in the risk estimates. For example, in a study of 3,000 cases and 3,000 controls, bias in increased from 0.005 to -0.43, while RMSE went from 0.12 to 0.44. Bias in decreased from 0.22 to 0.17, while RMSE remained the same. Estimates of βG and remained nearly unbiased.

Setting B.

We next examine a setting when two sets of parameters define the risk of disease, i.e. for D = 1 vs. D = 0 and D = 1* vs. D = 0 according to the risk model (2). Table 3 (n0 = n1 = 3,000) and S5 Table present parameter estimates in the setting when With these parameters, the frequencies of the disease of interest and the nuisance disease are pr(D = 1) = 25.1%, pr(D = 1*) = 12.5%, pr(D = 1|ε4+) = 45.4%, pr(D = 1*|ε4+) = 16.1%, pr(D = 1|ε4−) =20%, pr(D = 1*|ε4−) = 16.1%. When presence of the nuisance disease is ignored (uLR, pMLE), estimates of β0,βε4,βG×ε4,βG are substantially biased.For example, in a study with 3,000 cases and 3,000 controls, in the bias of uLR for is -0.22, while pMLE-DX reduced this bias to -0.006; the bias of uLR for is -0.13, while pMLE-DX reduced this bias to 0.01; the bias of uLR bias for is 0.30, while pMLE-DX reduced it to 0.005. Biases in uLR persisted for larger sample sizes. If a priori evidence is sufficient to set parameters and to 0, when in fact these coefficients are zero, then RMSE of pMLE-DX are further reduced by at least 2-fold (data not shown).

thumbnail
Table 3. Bias and RMSE in parameter estimates when = 0 and = 0.

https://doi.org/10.1371/journal.pone.0201140.t003

Table 4 and S6 Table present the results in a setting similar to that of Table 3 but when there is no interaction between the genotype and ApoE4 status, i.e. βG×ε4 = 0. Ignoring the nuisance disease in the uLR resulted in bias in the estimate of βG×ε4 that is -0.23, which might mislead to a conclusion that there is an interactive effect between the genotype and ApoE ε4 status. The bias persisted for larger sample sizes.

thumbnail
Table 4. Bias and RMSE in parameter estimates when = 0, βG×ε4 = 0 and = 0.

https://doi.org/10.1371/journal.pone.0201140.t004

Setting C.

We next conducted a simulation study to better understand the underlying nature of the biases in the estimates noted when presence of the nuisance disease is ignored (uLR). For clarity, we simulated all variables to be binary. Variables G,Z1 and Z2 are Bernoulli with frequencies 0.10, 0.52 and 0.50, respectively. Risk coefficients are Then we varied values of , and βG×ε4. The relationship between clinical and pathophysiological diagnosis is set to be pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06. We simulated 500 datasets with 3,000 cases and 3,000 controls.

Fig 1 presents a study where βε4 varies as log(1),log(1.5),log(2),log(2.5),…log(8) across the x-axis and is color-coded to be 0, 0.5, 1, 1.5. We show in panels A, B, C, D, and E, the biases of , and , respectively. With increasing value of βε4, the biases in the main effect estimates of and βG increase. For example, the bias in reaches -0.10 when βε4 is log(5). The bias in and is even more sensitive to value of βε4. For example, when βε4 = 0, the bias in is 0.8; while when βε4 = log(8) the bias is -0.7. Similarly, when βε4 = 0, the bias in is -0.18; while when βε4 = log(8) the bias becomes 0.6. Bias in the estimates of increases with the increase in the true value. Bias in the other estimates is nearly not affected by values of .

thumbnail
Fig 1.

The bias in estimates of (βAge) (A), (βSex) (B), βε4 (C), βG (D), and βG×ε4 (E) obtained using the usual logistic regression with clinical diagnosis as the outcome across values of βε4. Simulated are datasets with 3,000 cases and 3,000 controls. Values of βApoE4 are listed along the x-axis and the true values of are indicated by color. The parameters are set as follows: β0 = −1, βG = log(1.5), βG×ε4 = log(3); the relationship between the clinical and true disease statuses is pr(D = 1*|DCL = 1,ε4-) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06. Variables G,Z1 and Z2 are Bernoulli with frequencies 0.10, 0.50 and 0.52, respectively.

https://doi.org/10.1371/journal.pone.0201140.g001

Fig 2 presents a study where βG×ε4 varies as log(1),log(1.5),log(2),log(2.5),…log(8) across the x-axis and is color-coded to be 0, 0.5, 1, 1.5. We show in panels A, B, C, D, and E, the bias of , respectively. In this setting, the biases in the main effects and were approximately the same for all values of βG×ε4, while the biases in the estimates of and were more sensitive to the value of βG×ε4. For example, when the interaction coefficient is 0, the bias of is nearly -2, while when βG×ε4 = log(8) = 2.08, the bias goes up to 3. When βG×ε4 = 0, the bias in the estimate is nearly zero, while the bias goes to almost 6 when the true value is log(8).

thumbnail
Fig 2.

The bias in estimates of (βAge) (A), (βSex) (B), βε4 (C), βG (D), and βG×ε4 (E) obtained using the usual logistic regression with clinical diagnosis as the outcome across values of βG×ε4. Simulated are datasets with 3,000 cases and 3,000 controls. Values of βG×ApoE4 are listed along the x-axis and the true values of are indicated by color. The parameters are set as follows: β0 = −1, βG = log(1.5), βG×ε4 = log(3); the relationship between the clinical and true disease statuses is pr(D = 1*|DCL = 1,ε4-) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06. Variables G,Z1 and Z2 are Bernoulli with frequencies 0.10, 0.50 and 0.52, respectively.

https://doi.org/10.1371/journal.pone.0201140.g002

Analyses of genetic variants serving toll-like receptors and receptor for advanced glycation end products in Alzheimer’s disease

We applied the proposed analyses to a dataset collected as part of the Alzheimer’s Disease Genetics Consortium. The data has been anonymized prior to access by the authors. The data consists of 1,245 controls and 2,785 cases. The average age (SD) of Cases and controls are 72.1 (9.1) and 70.9 (8.8) years, respectively. Among cases, 1,458 (52.4%) are men; among controls, 678 (63.9%) are men. At least one ApoE ε4 allele is present in (64.5%) of cases and 365 (29.1%) of controls.

Illumina Human 660K markers have been mapped onto human chromosomes using NCBI dbSNP database (https://www.ncbi.nlm.nih.gov/projects/SNP/). Chromosome location, proximal gene or genes and gene structure location (e.g. intron, exon, intergenic, UTR) has been recorded for all SNPs. From these data, we inferred 111 SNPs to reside in genes serving Toll-Like Receptors (TLR). Similarly, we inferred 3 SNPs to reside in the Receptor for advanced glycation end products (AGER).

It is of interest to examine a relationship between the pathologic diagnosis and each of the 111 TLR SNPs (G), ApoE ε4 status (X), age (Z1), sex (Z2). The effect of SNPs might vary by ApoE ε4 hence we included interaction between the genotype and ApoE ε4 status. The genetic variables are modeled using a binary indicator of presence or absence of a minor allele.

We estimate parameters using the standard logistic model (uLR) that uses the clinical diagnosis as a surrogate of the pathophysiologic diagnosis and the pseudo-likelihood model (pMLE-DX) where we assume that the relationship between the clinical and pathophysiologic diagnosis is as estimated in the Salloway study [2], i.e. the proportion of the nuisance disease within the clinically diagnosed set is 36% in ApoE ε4 non-carriers and 6% in ApoE ε4 carriers. The pseudo-likelihood model pMLE-DX estimates the coefficients in a model that treats the nuisance disease and controls equivalently as in Eq (1). pMLE-DX*, however, estimates two sets of the risk coefficients as in Eq (2). Data analyses are performed using MatLab version R2017a. When optimizing the pseudolikelihood function we bounded the estimates to be on the interval [–5,5].

We first examine the results when statistical significance is assessed according to p-value<0.05. We next correct for false discovery rate using Benjanimi-Hochberg method [9].

TLR.

Shown in Table 5 are estimates of the risk coefficients for 53 SNPs with permutation-based p-values for or that are <0.05 in either of the analyses. Of these 53 SNPs, 28 SNPs are within 500k up- or downstream of the SNPs previously reported in GWAS on Alzheimer’ disease, dementia, tauopathy, or/and vascular disease (S6 Table).

thumbnail
Table 5. Parameter estimates in Alzheimer’s disease study.

https://doi.org/10.1371/journal.pone.0201140.t005

Estimates of βG or βG×ε4 differ numerically between the three approaches. For 14 of these 53 SNPs, have p-values <0.05 in uLR, while in pMLE-DX and pMLE-DX* the corresponding p-values are >0.05. These associations detected by uLR might be spurious as a result of clinical-pathophysiological diagnoses relationship varying by ApoE ε4 status.

One SNP, rs830832, has significant both in uLR () and . This SNP locates at the intergenic region between SORBS2 and TLR3 at Chromosome 4 and are 72k downstream of SNP rs75718659, which was reported associated with Alzheimer’s disease in a family-based GWAS [10].

Among the seven SNPs appear to have significant in pMLE- DX* but not uLR, two of the SNPs: rs4862611 ( = -2.8, p = 0.03) and rs1706143 ( = 2.9, p = 0.03), are also located at the intergenic region between SORBS2 and TLR3 at Chromosome 4 and are 80k and 20k downstream of SNP rs75718659.

Nine of the SNPs appear to have significant in pMLE-DX* but not in uLR or pMLE-DX. Two SNPs, rs7676342 () and rs13113778 (), again are located in the intergenic region between SORBS2 and TLR3 at Chromosome 4 and are 80k and 100k downstream of SNP rs75718659, respectively. Three SNPs, rs955302 (), rs4837254 () and rs12342331 (), are located at the intergenic region between ASTN2 and TLR4 at Chromosome 9 and are 400k, 430k, and 492k downstream of rs1360695 associated with Schizophrenia [11].

Estimates of βG, however, are generally larger in magnitude when estimated in pMLE-DX and pMLE- DX* models.

Two SNPs appear to be associated with the diagnosis both in uLR and pMLE-DX. SNP rs7656500 (uLR and pMLE-DX ) locates at the intergenic region between KIAA0922 and TLR2 at Chromosome 4, and is 163k upstream and 144k downstream of rs727153 and rs1466662, respectively, which were reported associated with Alzheimer’s disease in two studies [12,13]. It is also 54k upstream of rs7654093 associated with thrombosis [14], 30k upstream of rs7659024 associated with Venous thromboembolism [15], 34k upstream of rs2066865 associated with Venous thromboembolism [16, 17], 52k upstream of rs6536024 associated with Venous thromboembolism [18], and 360k downstream of rs11099942 associated with Type 2 diabetes [19].

Among six SNPs which appear to be significantly associated with the nuisance diagnosis in absence of an interactive effect, three SNPs rs1869617 (at the intergenic region between SORBS2 and TLR3 at Chromosome 4, pMLE- ), rs3775296 (at the UTR region of TLR3 at Chromosome 4, pMLE- ), rs7668666 (at the INTRON region of TLR3 at Chromosome 4, pMLE- ) locate 110k, 176k and 179k, respectively, downstream of rs75718659 reported associated with Alzheimer’s disease [10] and another two SNPs rs16905625 (pMLE- ) and rs1890047 (pMLE- ) locate at the intergenic region between ASTN2 and TLR4 at Chromosome 9, 412k and 428k, respectively downstream of rs1360695 reported associated with Schizophrenia [11].

Estimates of βε4 in the absence of interaction are generally larger in magnitude for the diagnosis of interest in pMLE-DX. For example, in a model with SNP rs1816702 (uLR and pMLE- ).

AGER.

All of the three SNPs in the AGER gene measured in the data are associated with susceptibility to AD as inferred in uLR and also are associated with susceptibility to the nuisance disease when measured by pMLE- DX*. rs3134940 has been previously reported in association to breast cancer, type I diabetes and other phenotypes (https://www.gwascentral.org/marker/HGVM1600838/results?t=ZERO); rs1035798 and rs2070600 have been previously reported in association to rheumatoid arthritis (https://www.gwascentral.org/marker/HGVM275161/results?t=ZERO and https://www.gwascentral.org/marker/HGVM571318/results?t=ZERO).

Discussion

We investigated if disease heterogeneity among clinically diagnosed cases could introduce bias into the estimates of GxE interactions. We showed that when there is a strong association between the environmental variable and the relative risk of the disease of interest, as compared to the nuisance disease, and then there could be bias in either direction. We base our developments on the method by Chatterjee and Carroll [7] that is fully efficient in situations when the genetic and environmental variables are distributed independently in the population, a population-based genetics model is assumed for the genetic factors and the environmental variables are treated non-parametrically.

Interestingly, in our analyses, the estimates of regression coefficients are qualitatively differed between the analyses that used the clinical diagnosis as a surrogate of the pathologic diagnosis and the analyses that used our newly proposed pseudo-likelihood approach that incorporates the uncertainty of the clinical diagnosis. Specifically, in TLR set for 13% of the SNPs examined, GxE was found to be significant in the relationship to the clinical diagnosis, while the pseudo-likelihood analyses inferred these GxE to be not significant. On the other hand, for 14% of the SNPs that we examined, GxE was found to be statistically significant only when we incorporated the uncertainty in the clinical-pathological diagnoses relationship. This finding is consistent with the conclusion reached by a study of phenotypic misclassification among cases [20] in situations when the misclassification is non-differential, i.e. is not a function of the environmental variables. The study concluded that presence of “non-cases” greatly decreased the estimates of risk attributed to the genetic variation.

One of the major concerns in the analyses of the genetic studies has been the missing heritability, when the genetic markers identified thus far explain only a small portion of inter-person variability in familiar clustering of complex diseases [21]. The downward biases in the estimates associating GxE to the clinically diagnosed disease status might in part explain the missing heritability. On the other hand, the upward biases in these estimates might in part address the conclusion reached by [22] that only 1% of the association found are likely to be true.

We examined estimates of the genetic effects, ApoE4 status, and age, sex consistent with the original publication on this dataset [23]. Epidemiologic evidence [24] suggests that the following factors play important role in AD risk: education/cognitive reserve, racial and ethnic difference, gender, smoking, drinking, head injury, diabetes, cardiovascular disease, obesity, social engagement, etc. However, not all of these factors have been consistently confirmed by subsequent studies, and considerable inconsistencies exist. For example, nicotine intake has been observed to decrease the risk of dementia due to the demonstrated ability of nicotine to stimulate neurotransmitter systems that are compromised in dementia [25]. More recent studies have suggested that nicotine intake may increase the risk of AD and also bring forward age of onset with APOE interactive effect [26].

The main conclusion reached in this paper is that using the clinically diagnosed status can lead to severely biased estimates of GxE interactions in situations when the frequency of the pathologic diagnosis of interest, as compared to other diagnoses, depends on the environment, and we aim to correct such biases by proposing pseudolikelihood method. AD dataset is mainly used for illustration, therefore, for clarity we restricted to variables to the minimum necessary instead of considering full risk prediction modes which might be able to better describe the inter-patient variability in susceptibility to AD. Although other factors are potentially important in predicting the risk of AD, this relatively simple model was able to achieve the main goals of the current manuscript. By recognizing and accounting for the potential of case heterogeneity, which biases the gene x environment interaction, our newly proposed method has the ability to remove this bias.

Define E to be the set of variables in the model, i.e. age, sex. Let O define a set of key environmental variables omitted from the model. Addition of variables O would not modify the effect estimates of GxE beyond what is expected purely by chance if O does not interact with either G or E. Also, if conditional on the diagnosis of AD, GxE is independent of O, then omission of O does not change the effect estimate of GxE [27]. If, however, O interacts with GxE, then addition of these variables would change the effect estimate of GxE in the direction that is consistent with the direction of the GxE effect. Further studies that incorporate environmental variables, such as medical history, tobacco use, and infections are needed for their potential to modify the risk and the estimates of GxE in particular.

Epigenetic mechanisms are well-recognized in the mediation of GxE and analysis of epigenetic changes at the genome scale can offer new insights into the relationship between brain epigenomes and AD. Further, candidate genes from epigenome-wide association studies interact with those from GWAS that can undergo epigenetic changes in their upstream gene regulatory elements [28]. However, an active conundrum is how the epigenetic mechanisms influence gene-environment interactions.

Appendix

Derivation of pseudo-likelihood (2) and covariance matrix

Derivation of the pseudo-likelihood (2) is straightforward.

Next we demonstrate that the pseudo-likelihood (2) has zero mean when evaluated at the true parameters. Derivative of (2) with respect to Ω is

Let p(x,z|η) be the density of the environment.

Note the conditional probabilities

Hence

Therefore the derivative of the pseudo-likelihood has zero mean when evaluated at the true parameters. Evaluated at the true parameters the estimating function (2) takes the following form

Covariance matrix is then

Define

The covariance matrix can then be represented in the form Σ = Σ12−Λ.

Define and , then I = I1I2.

We note that and . Hence Σ = I1I2−Λ = I−Λ.

Supporting information

S1 Table. βG×ε4 ≠ 0.

The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n0 controls and n1 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z1 and Z2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles. Frequency of nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|DCL = 1,ε4) = 0.06.

https://doi.org/10.1371/journal.pone.0201140.s001

(DOCX)

S2 Table. βG×ε4 = 0.

The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n0 controls and n1 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z1 and Z2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles. Frequency of nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06.

https://doi.org/10.1371/journal.pone.0201140.s002

(DOCX)

S3 Table. Frequency of the nuisance disease is underestimated.

The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n0 = 3000 controls and n1 = 3000 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z1 and Z2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles. Frequency of nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06. The clinical-pathophysiological diagnoses relationship is misspecified to be pr(D = 1*|DCL = 1,ε4−) = 0.30 and pr(D = 1*|DCL = 1,ε4+) = 0.

https://doi.org/10.1371/journal.pone.0201140.s003

(DOCX)

S4 Table. Frequency of the nuisance disease is overestimated.

Bias and Root Mean Squared Error (RMSE) for parameter estimates based on a study of 500 simulated datasets with n0 controls and n1 cases with clinical phenotype. Analyses are based on the usual logistic regression model that ignores nuisance disease and based on pseudolikelihood with (pMLE-DX) and without the consideration of clinical-pathological diagnoses relationship (pMLE). Frequency of ApoE ε4 alleles is 14% in the population. Variables Z1 and Z2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles. Frequency of nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1′|DCL = 1,ε4−) = 0.36 and pr(D = 1′|DCL = 1,ε4+) = 0.06. The clinical-pathological diagnoses relationship is misspecified to be pr(D = 1′|DCL = 1,ε4−) = 0.42 and pr(D = 1′|DCL = 1,ε4+) = 0.12.

https://doi.org/10.1371/journal.pone.0201140.s004

(DOCX)

S5 Table. and .

The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as t he outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n0 controls and n1 cases. Risk of the disease of interest is defined in a set of parameters ; while the risk of the nuisance disease is parametrized by Frequency of ApoE ε4 allele in the population is 14%. Variables Z1 and Z2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequencies of the disease of interest and the nuisance disease are pr(D = 1) = 24.8%, pr(D = 1*) = 12.5%, pr(D = 1|ε4+) = 43%, pr(D = 1*|ε4+) = 16.1%, pr(D = 1|ε4−) = 20%, pr(D = 1*|ε4−) = 11.6%. Frequency of the nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06.

https://doi.org/10.1371/journal.pone.0201140.s005

(DOCX)

S6 Table. βG×ε4 = 0, .

The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n0 controls and n1 cases. Risk of the disease of interest is defined in a set of parameters ; while the risk of the nuisance disease is parametrized by . Frequency of ApoE ε4 allele in the population is 14%. Variables Z1 and Z2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequencies of the disease of interest and the nuisance disease are pr(D = 1) = 24.8%, pr(D = 1*) = 12.5%, pr(D = 1|ε4+) = 43%, pr(D = 1*|ApoE4+) = 16.1%, pr(D = 1|ε4−) = 20%, pr(D = 1*|ε4−) = 11.6%. Frequency of the nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1*|DCL = 1,ε4−) = 0.36 and pr(D = 1*|DCL = 1,ε4+) = 0.06.

https://doi.org/10.1371/journal.pone.0201140.s006

(DOCX)

S7 Table. Parameter estimates in Alzheimer’s disease study.

Analyses are performed using the usual logistic regression (uLR) that uses the clinical diagnosis as an outcome and using pseudo-likelihood method that assumes that the proportion of nuisance disease within the clinically diagnosed AD is 36% for ε4 carriers and is 6% for ε4 non-carriers. Pseudo-likelihood analyses pMLE-DX estimates parameters for D = 1 vs. D = 0 and D = 1* combined. Pseudo-likelihood analyses pMLE − DX*, however, estimate two sets of risk coefficients, i.e. βs for D = 0 vs. D = 1 and β*s D = 0 vs. D = 1*.

https://doi.org/10.1371/journal.pone.0201140.s007

(DOCX)

S8 Table. SNPs previously reported in GWAS that are within 500k up- or downstream of SNPs that we inferred in Alzheimer’s disease study.

(Table 5, SNPs whose effect estimates of βG and/or βG×ε4 are with permutation-based p-value <0.05).

https://doi.org/10.1371/journal.pone.0201140.s008

(DOCX)

Acknowledgments

Dr. Lobach and Dr. Zhang are supported by 5R21AG043710-02.

The Alzheimer’s disease dataset is available in the Database of Genotypes and Phenotypes study accession number phs000372.v1.p1

(https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000372.v1.p1).

Genotyping is performed by Alzheimer's Disease Genetics Consortium (ADGC), U01 AG032984, RC2 AG036528. Phenotypic collection is coordinated by the National Alzheimer's Coordinating Center (NACC), U01 AG016976.

Samples from the National Cell Repository for Alzheimer’s Disease (NCRAD), which receives government support under a cooperative agreement grant (U24 AG21886) awarded by the National Institute on Aging (NIA), were used in this study. We thank contributors who collected samples used in this study, as well as patients and their families, whose help and participation made this work possible.

Data for this study were prepared, archived, and distributed by the National Institute on Aging Alzheimer’s Disease Data Storage Site (NIAGADS) at the University of Pennsylvania (U24-AG041689-01).

References

  1. 1. Potter H, Wisniewski T. Apolipoprotein E: essential catalyst of the Alzheimer amyloid cascade. International Journal of Alzheimer’s Disease. 2012; http://dx.doi.org/10.1155/2012/489428
  2. 2. Salloway S, Sperling R. Understanding conflicting neurological findings in patients clinically diagnosed as having Alzheimer Dementia. JAMA Neurology. 2015; 72 (10): 1106–1108. pmid:26302229
  3. 3. Shaw AC, Panda A, Joshi SR, Qian F, Allore HG, Montgomery RR. Dysregulation of human toll-like receptor function in aging. Ageing Research Review. 2011 Jul;10(3):346–53. pmid:21074638
  4. 4. Ramasamy R, Vannucci SJ, Yan SSD, Herold K, Yan SF, Schmidt AM. Advanced glycation end products and RAGE: a common thread in aging, diabetes, neurodegeneration, and inflammation. Glycobiology. 2005; 15(7): 16R–28R. pmid:15764591
  5. 5. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu . Measurement error in nonlinear models: a modern perspective. 2nd ed. Chapman and Hall/CRC; 2006.
  6. 6. Prentice KL, Pyke DA. Logistic disease incidence models and case-control studies, Biometrika.1979; 66(3): 403–411.
  7. 7. Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005; 92(2): 399–418.
  8. 8. Hardy GH. Mendelian Proportions in a Mixed Population. Science. 1908; 28(706): 49–50. pmid:17779291
  9. 9. Benjamini Y., and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series. 1995; B 57: 289–300.
  10. 10. Herold C, Hooli BV, Mullin K, Liu T, Roehr JT, Mattheisen M, et al. (2016) Family-based association analyses of imputed genotypes reveal genome-wide significant association of alzheimer’s disease with OSBPL6, PTPRG and PDCL3. Molecular Psychiatry. 2016; 21(11):1608–1612. pmid:26830138
  11. 11. Goes FS, McGrath J, Avramopoulos D, Wolyniec P, Pirooznia M, Ruczinki I, et al. (2015) Genome-wide association study of schizophrenia in Ashkenazi Jews. American Journal of Medical Genetics. 2015; 168(8):649–659. pmid:26198764
  12. 12. Abraham R, Moskvina V, Sims R, Hollingworth P, Morgan A, Georgieva L, et al. A genome-wide association study for late-onset Alzheimer’s disease using DNA pooling, BMC Medical Genomics. 2008; 29: 1–44.
  13. 13. Kamboh MI, Barmada MM, Demirci FY, Minster RL, Carrasquillo MM, Pankratz VS, et al. Genome-wide association analysis of age-at-onset in Alzheimer's disease. Molecular Psychiatry. 2012 Dec;17(12):1340–6. pmid:22005931
  14. 14. Hinds DA, Buil A, Ziemek D, Martinez-Perez A, Malik R, Folkersen L et al. Genome-wide association analysis of self-reported events in 6135 individuals and 252 827 controls identifies 8 loci associated with thrombosis. Human Molecular Genetics. 2016; 25(9):1867–1874. pmid:26908601
  15. 15. Germain M, Saut N, Greliche N, Dina C, Lambert JC, Perret C, et al. Genetics of venous thrombosis: insights from a new genome wide association study. PLOS One. 2011; 6(9): e25581 pmid:21980494
  16. 16. Germain M, Chasman DI, de Haan H, Tang W, Lindström S, Weng LC, et al. Meta-analysis of 65,734 individuals identifies TSPAN15 and SLC44A2 as two susceptibility loci for venous thromboembolism. American Journal of Human Genetics. 2015 Apr 2;96(4):532–42. pmid:25772935
  17. 17. Klarin D, Emdin CA, Natarajan P, Conrad MF, INVENT Consortium, Kathiresan S. Genetic Analysis of Venous Thromboembolism in UK Biobank Identifies the ZFPM2 Locus and Implicates Obesity as a Causal Risk Factor. Circulation Cardiovascular Genetics. 2017 Apr;10(2). pii: e001643. pmid:28373160
  18. 18. Tang W, Teichert M, Chasman DI, Heit JA, Morange PE, Li GA genome-wide association study for venous thromboembolism: the extended Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, Genetic Epidemiology Journal 2013 37(5): 512–521
  19. 19. Hamet P, Haloui M, Harvey F, Marois-Blanchet FC, Sylvestre MP, Tahir MR, et al. PROX1 gene CC genotype as a major determinant of early onset of type 2 diabetes in slavic study participants from Action in Diabetes and Vascular Disease: Preterax and Diamicron MR Controlled Evaluation study, Journal of Hypertension. 2017 May; 35 Suppl 1:S24–S32. pmid:28060188
  20. 20. Manchia M, Cullis J, Gustavo T, Rouleau GY, Uher R, Alda M. The impact of phenotypic and genetic heterogeneity on results of genome-wide association studies of complex diseases. PLOS One. 2013; 8(10): e76295. pmid:24146854
  21. 21. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorf LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. 2009 Oct 8;461(7265):747–53. pmid:19812666
  22. 22. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet Med. 2002; 2:45–61.
  23. 23. Naj AC, Jun G, Beecham GW, Wang LS, Vardarajan BN, Buros J, et al. Common variants at MS4A4/MS4A6E, CD2AP, CD33 and EPHA1 are associated with late-onset Alzheimer’s. Nature Genetics. 2011 May;43(5):436–41. pmid:21460841
  24. 24. Richie K, Carriere I, Richi CW, Berr C, Artero S, Ancelin ML. Designing prevention programs to reduce incidence of dementia: prospective cohort study of modifiable risk factors. British Medical Journal. 2010 Aug 5;341:c3885. pmid:20688841
  25. 25. Van Duijn CM, Hofman A. Relation between nicotine intake and Alzheimer’s disease. British Medical Journal. 1991; 22:1491–1494.
  26. 26. Lee, C., Alekseenko, A., Brown, T. (2009) Exploring the future of bioinformatics data sharing and mining with Pygr and Worldbase, Proceedings of the 8th Python in Science Conference (SciPy 2009)
  27. 27. Hauck WW, Neuhaus JM, Kalbfleisch JD, Anderson S (1991) A consequence of omitted covariates when estimating odds ratios. J Clin Epidemiol; 44(1):77–81 pmid:1986061
  28. 28. Hoffman A, Sportelli V, Ziller M, Spengler D. Driver or Passenger: Epigenomes in Alzheimer’s disease. Epigenomes. 2017; 1(1) 5;