Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Likelihood Ratio Test for Excess Homozygosity at Marker Loci on X Chromosome

  • Xiao-Ping You,

    Affiliation State Key Laboratory of Organ Failure Research and Guangdong Provincial Key Laboratory of Tropical Research, School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, China

  • Qi-Lei Zou,

    Affiliation State Key Laboratory of Organ Failure Research and Guangdong Provincial Key Laboratory of Tropical Research, School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, China

  • Jian-Long Li,

    Affiliation State Key Laboratory of Organ Failure Research and Guangdong Provincial Key Laboratory of Tropical Research, School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, China

  • Ji-Yuan Zhou

    zhoujiyuan5460@hotmail.com

    Affiliation State Key Laboratory of Organ Failure Research and Guangdong Provincial Key Laboratory of Tropical Research, School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, China

Abstract

The assumption of Hardy-Weinberg equilibrium (HWE) is generally required for association analysis using case-control design on autosomes; otherwise, the size may be inflated. There has been an increasing interest of exploring the association between diseases and markers on X chromosome and the effect of the departure from HWE on association analysis on X chromosome. Note that there are two hypotheses of interest regarding the X chromosome: (i) the frequencies of the same allele at a locus in males and females are equal and (ii) the inbreeding coefficient in females is zero (without excess homozygosity). Thus, excess homozygosity and significantly different minor allele frequencies between males and females are used to filter X-linked variants. There are two existing methods to test for (i) and (ii), respectively. However, their size and powers have not been studied yet. Further, there is no existing method to simultaneously detect both hypotheses till now. Therefore, in this article, we propose a novel likelihood ratio test for both (i) and (ii) on X chromosome. To further investigate the underlying reason why the null hypothesis is statistically rejected, we also develop two likelihood ratio tests for detecting (i) and (ii), respectively. Moreover, we explore the effect of population stratification on the proposed tests. From our simulation study, the size of the test for (i) is close to the nominal significance level. However, the size of the excess homozygosity test and the test for both (i) and (ii) is conservative. So, we propose parametric bootstrap techniques to evaluate their validity and performance. Simulation results show that the proposed methods with bootstrap techniques control the size well under the respective null hypothesis. Power comparison demonstrates that the methods with bootstrap techniques are more powerful than those without bootstrap procedure and the existing methods. The application of the proposed methods to a rheumatoid arthritis dataset indicates their utility.

Introduction

Association analysis is a useful tool to map disease loci by using markers on autosomes based on family data and case-control data [19]. There has been an increasing interest of exploring the association between diseases and markers on X chromosome and the effect of the departure from Hardy-Weinberg equilibrium (HWE) on association analysis on X chromosome [1017]. Note that there are two hypotheses of interest regarding the X chromosome: (i) the frequencies of the same allele at a locus in males and females are equal and (ii) the inbreeding coefficient in females is zero (without excess homozygosity) in X-specific quality control [18, 19]. As such, excess homozygosity in females and significantly different minor allele frequencies between males and females are used to filter X-linked variants [20, 21]. The inbreeding coefficient is generally estimated by functions of excess homozygosity [22, 23], which may be caused by population substructure, consanguineous mating or factors like null alleles [24, 25]. Overall and Nichols developed an approach to distinguish population substructure and consanguinity by using multilocus genotype data [24]. On the other hand, Zheng et al. proposed two test statistics to test for the equality of the frequencies of the same allele in males and females and the zero inbreeding coefficient in females on X chromosome, respectively [14]. However, they only focused on association analysis on X chromosome and the type I error rates and powers of these two test statistics have not been studied yet. Further, there is no existing method to simultaneously detect both of the issues till now.

Therefore, in this article, we first combine two test statistics proposed in zheng et al. [14] and suggest Z0 to simultaneously test for (i) the equality of the frequencies of the same allele in males and females and (ii) the zero inbreeding coefficient on X chromosome based on the collected sample. For the purpose of improving the test power for both (i) and (ii), a novel likelihood ratio test on X chromosome is proposed. We write out the likelihood functions of the collected sample under the null hypothesis and alternative hypothesis at a single locus on X chromosome, respectively. Next, we obtain the maximum likelihood estimates (MLEs) of the unknown parameters by expectation-maximization (EM) algorithms [26] and construct the corresponding likelihood ratio test (LRT0) statistic to test for both (i) and (ii). If the null hypothesis is statistically rejected, we further conduct two hypothesis testing issues to find the underlying reasons why the null hypothesis is violated by proposing another two likelihood ratio tests LRT1 (for the equality of the frequencies of the same allele in males and females) and LRT2 (for excess homozygosity). Note that the size of LRT0 and LRT2 is conservative from our simulation study. As such, we use parametric bootstrap techniques to evaluate the validity and performance of LRT0 and LRT2, which are respectively denoted by LRT0b and LRT2b. Moreover, we explore the effect of population stratification on the proposed tests. In addition, the root mean squared error (RMSE) and bias are used to assess the accuracy of the MLEs of the unknown parameters. Finally, the application of the proposed methods to a rheumatoid arthritis (RA) dataset indicates their utility.

Materials and Methods

Background and notations

Consider a biallelic marker locus on X chromosome with alleles M1 and M2. Let pm and pf be the frequencies of M1 in males and females, respectively. As such, the frequencies of M2 in males and females are qm = 1 − pm and qf = 1 − pf, respectively. In females, let ρ be the inbreeding coefficient, which is generally nonnegative [2729]. Thus, the frequencies of three genotypes M1M1, M1M2 and M2M2 in females can be expressed as follows:

To this end, there is no excess homozygosity in females when ρ = 0; excess homozygosity exists when ρ > 0. Note that pmpf may be true on X chromosome. So, we construct the null hypothesis denoted by H0: pm = pf and ρ = 0 to test for both of the hypotheses (i) and (ii). If the null hypothesis is violated, we need to investigate which one of pmpf and ρ > 0 is true. As such, we have other two hypothesis testing issues with the null hypothesis being H01: pm = pf and H02: ρ = 0, respectively. It should be noted that X chromosome has the problem of X chromosome inactivation and dosage compensation [30], but we do not consider them in this section. The corresponding discussion can be found later (see the Discussion section).

Assume that n1m and n0m represent the numbers of males with alleles M1 and M2 in a collected sample, respectively; n2f, n1f and n0f denote the numbers of females with genotypes M1M1, M1M2 and M2M2, respectively. Then, Nm = n1m + n0m and Nf = n2f + n1f + n0f are respectively the numbers of males and females in the sample, and N = Nm + Nf is the sample size.

Existing methods Z1 and Z2 for H01 (equality of the frequencies of the same allele in males and females) and H02 (zero inbreeding coefficient), respectively

Zheng et al. [14] proposed the test statistic to test for H01: pm = pf, where and are the estimates of pm and pf, , and are the estimates of the variances of and under H01, respectively, with . Under H01, Z1 asymptotically follows the chi-square distribution with one degree of freedom when the sample size is large enough.

Weir and cockerham [31] introduced the disequilibrium coefficient in females . In other words, testing for Δf = 0 is equivalent to testing for ρ = 0. Hence, zheng et al. [14] further developed the following test statistic to test for H02: ρ = 0, where , , and . Under H02, Z2 approximately follows the chi-square distribution with one degree of freedom when Nf is large enough. It should be noted that the test Z2 has nothing to do with male individuals and thus only needs female individuals.

Z0 test for both hypotheses (i) and (ii) of interest regarding the X chromosome

Zheng et al. [14] showed that, under H0: pm = pf and ρ = 0, Z1 and Z2 are independent. However, they did not propose the corresponding test statistic for H0. As such, we suggest the test statistic to test for H0: pm = pf and ρ = 0. Under H0, Z0 asymptotically follows the chi-square distribution with the degrees of freedom being 2. Moreover, it should be noted that we can use to estimate the inbreeding coefficient ρ.

Likelihood ratio test for both hypotheses (i) and (ii) of interest regarding the X chromosome

To construct a likelihood ratio test (LRT) for H0: pm = pf and ρ = 0, we give the likelihood function of the sample as follows: (1) where θ = (pm, pf, ρ). Firstly, we use the following EM algorithm to estimate the unknown parameters pm, pf and ρ under the alternative hypothesis (H1: pmpf or ρ > 0). Suppose that Y = (Y1, Y2, Y3, Y4, Y5) = (n1m, n0m, n2f, n1f, n0f) denotes the observed data. (Y1, Y2, Y3, Y4, Y5) can be augmented by splitting the third cell into two cells W1 and W2, which are unobservable random variables such that Y3 = W1 + W2 for female homozygote M1M1 and W1 and W2 follow the binomial distributions with success probabilities and , respectively, and by splitting the fifth cell into two cells W3 and W4, where Y5 = W3 + W4 for female homozygote M2M2 and W3 and W4 follow the binomial distributions with success probabilities and , respectively. Thus, the likelihood function of complete data (n1m, n0m, w1, w2, n1f, w3, w4) is: where the normalizing constant is omitted for brevity.

At the E-step, the Q function at iteration (k + 1) is constructed as where θ(k) is the estimate of θ at iteration k.

At the M-step, the estimated value θ(k+1) of θ at iteration (k + 1) can be obtained by maximizing the Q function with respect to θ. Therefore, the MLEs of pm, pf and ρ at iteration (k + 1) are respectively

Note that the MLE of pm is the same for all the iterations, which is also the same as zheng et al. [14]. In the above expressions, (2) (3) (4) (5) where . Given the initial value θ(0) of θ, the above-mentioned two steps continue until the convergence criterion is satisfied. For example, the absolute differences between the estimates of the parameters at two consecutive iterations are all less than 10−7. The value of θ obtained at the last iteration is taken as the MLE of θ under H1.

Note that pm = pf and ρ = 0 under H0. Let p = pm = pf, the pooled allele frequency of M1. Then, L(θ) in Eq (1) can be rewritten as

Thus, the MLE of p under H0 is , the estimated pooled allele frequency of M1. Let . Then, we can construct the following LRT to test for H0 (6) which asymptotically follows a chi-square distribution with the degrees of freedom being 2 when the null hypothesis holds.

Likelihood ratio test for equality of frequencies of the same allele in males and females

Once the null hypothesis (H0: pm = pf and ρ = 0) is rejected based on the result of Eq (6), we further need to consider the following two tests H01: pm = pf and H02: ρ = 0. Note that under the null hypothesis H01: pm = pf = p, ρ may not be zero and we need to estimate it. Let ϕ = (p, ρ) and q = 1 − p. Thus, the corresponding likelihood function of complete data is

We use the following EM algorithm to estimate ϕ under H01. The corresponding formulas at iteration (k + 1) are as follows where and are respectively the MLEs of p and ρ at iteration (k + 1), and . Eϕ(k)(w1|n2f), Eϕ(k)(w2|n2f), Eϕ(k)(w3|n0f) and Eϕ(k)(w4|n0f) in the above expressions are similar to Eθ(k)(w1|n2f), Eθ(k)(w2|n2f), Eθ(k)(w3|n0f) and Eθ(k)(w4|n0f) in Eqs (2)–(5), just replacing , and in Eqs (2)–(5) by , and , respectively. Let . Then, we propose the following test statistic LRT1 to test for the null hypothesis H01: pm = pf, (7) which approximately follows a chi-square distribution with the degree of freedom being 1 under H01.

Likelihood ratio test for inbreeding coefficient being zero

Note that under the null hypothesis H02 : ρ = 0, pm and pf may be different from each other and we need to estimate them separately. Let ψ = (pm, pf) and L(θ) in Eq (1) can be rewritten as

Then, the MLEs of pm and pf are and , respectively, which are the same as zheng et al. [14]. Let . As such, we develop the following test statistic to test for H02 : ρ = 0 (8) which asymptotically follows a chi-square distribution with the degree of freedom being 1 under H02. Just like the Z2 test statistic, LRT2 only uses female individuals in the sample because the terms based on male individuals in the numerator and the denominator of the fraction are the same, which can be reduced.

Likelihood ratio tests via parametric bootstrap for H0 and H02

It should be noted from our simulation results (see the Results section) that the simulated type I error rates of LRT0 and LRT02 respectively for H0 and H02 are too conservative. On the other hand, several studies showed that the likelihood ratio tests may typically not follow a chi-square distribution asymptotically [31, 32], and hence their exact distributions can be obtained by Monte Carlo simulation [33]. Accordingly, we make use of parametric bootstrap techniques to evaluate the size and power of these two methods. For convenience, we denote these methods via parametric bootstrap by LRT0b and LRT2b, respectively. We begin by describing the implementation steps for LRT0b as follows:

  1. For a collected sample of size N with Nm males and Nf females, calculate the value of LRT0;
  2. Compute the estimated pooled allele frequency based on the sample as follows: ;
  3. Based on , calculate the frequencies of three genotypes M1M1, M1M2 and M2M2 in females under H0 in the following: , and , respectively, where ;
  4. According to and , regenerate the alleles of Nm males; based on , and , regenerate the genotypes of Nf females;
  5. Calculate the value of LRT0 based on the new Nm males and Nf females, denoted by ;
  6. Repeat Steps 4 and 5 B times, which results in B test statistics , , …, ;
  7. The P-value of the original LRT0 can be estimated as

For LRT2b, we can conduct the steps similar to those mentioned above. Firstly, after obtaining the value of LRT2, calculate the frequencies of three genotypes M1M1, M1M2 and M2M2 in females under H02 in the following: , and , respectively, with . The alleles of Nm males stay the same as the original sample and only regenerate the genotypes of Nf females according to , and . Then, carry out the similar procedures of Steps 4–7 and we can obtain the the estimated P-value of LRT2.

Software implementation

We have written the XHWE software with R (http://www.r-project.org), which includes the eight test statistics: LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2. The R package named XHWE is available on CRAN (http://cran.r-project.org/web/packages/XHWE/). The initial values of pm, pf, p and ρ in the EM algorithms are taken to be n1m/Nm, (2n2f + n1f)/(2Nf), (n1m + 2n2f + n1f)/(Nm + 2Nf) and 0.02, respectively. The convergence criterion is that the absolute differences between the estimates of the parameters at two consecutive iterations are all less than 10−7 for the LRT-type statistics. The default maximum number of iterations is 1000. The input data file is the standard pedigree data. The XHWE software only uses the founders with genotypes available in it and will analyze marker loci one by one. The software outputs the values of all the test statistics and the corresponding P-values. Also, the XHWE software outputs the estimates of all the parameters under both the null and alternative hypotheses for each test statistic. The parameter estimates under the alternative hypothesis for the LRT-type test statistics are the same. However, under the respective null hypotheses of the LRT-type test statistics, the estimates may be different. It should be noted that the estimates of pm and pf under the null hypothesis of H02 in this article and those in zheng et al. [14] are the same, respectively. The output results will be automatically saved in the text file named “results.txt”.

Simulation settings

Simulation study is conducted to assess the performance of the proposed LRT0, LRT0b, LRT1, LRT2, LRT2b and Z0 test statistics and to compare them with the existing Z1 and Z2 under various simulation settings which are similar to those in zheng et al. [14]. The allele frequency pm in males takes two values: 0.3 and 0.5. When pm is fixed, the value of pf in females is taken as pf = pm + ϵ, where ϵ = 0, ±0.04 and ±0.05. The inbreeding coefficient ρ in females is set at 0 to 0.1 in increment of 0.05. The sample size is taken as 800 and 1200 with the ratio r = Nm : Nf being 2:1, 1.5:1, 1:1, 1:1.5 and 1:2. As mentioned earlier, when pm = pf and ρ = 0, the size of all the eight test statistics is simulated; when pm = pf and ρ > 0, the size of LRT1 and Z1 is gotten; when pmpf and ρ = 0, the size of LRT2, LRT2b and Z2 is obtained. Otherwise, we simulate the corresponding powers. In addition, it should be noted that for the fixed sample size (800 or 1200) simulated above, the powers of all the three test statistics LRT2, LRT2b and Z2 for H02 : ρ = 0 are not so large, from our simulation results below. On the other hand, these three test statistics only use female individuals. As such, we further obtain the sample size Nf required for LRT2b to gain 80% simulated power and then simulate the size and powers of LRT2, LRT2b and Z2 under this sample size. To investigate how population structure affects the proposed methods, we also consider the following population stratification model with two subpopulations in our simulation study. pm = 0.3 (0.5), pf = pm + ϵ, ϵ = 0, ±0.04 and ±0.05 in the first (second) subpopulation and the ϵ values are respectively denoted by ϵ1 and ϵ2. Assume that ρ = 0 in each subpopulation, and the ratio of each subpopulation constructing the population is set to 0.5. The sample size is taken to be 1800, where each individual is a female or a male with equal probability. Note that under population stratification, the null hypothesis H0: pm = pf and ρ = 0 is generally not true. Thus, we use the population stratification model to study the powers of the proposed methods. The significance level is fixed at 5% and 10000 replications are simulated under each simulation setting. For LRT0b and LRT2b via parametric bootstrap, B is set to be 1000. Finally, to compare the efficiency of the parameter estimates of the proposed EM algorithms with those in zheng et al. [14] for each simulation setting, we use the RMSEs and biases to assess the accuracy of the parameter estimates, where and , and β is the parameter which needs to estimate.

Results

Simulation results

Table 1 lists the simulated size of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 under H0 : pm = pf = p and ρ = 0 with N = 800 and 1200 and p = 0.3 and 0.5 for different values of r = Nm : Nf. According to the table, the size of LRT1, Z0, Z1 and Z2 is close to the nominal 5% level, while the size of LRT0 and LRT2 is too conservative. However, after the parametric bootstrap technique, LRT0b and LRT2b stay close to the nominal 5% level.

thumbnail
Table 1. Simulated size (in %) of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 under H0 : pm = pf = p and ρ = 0 with N = 800 and 1200 for different values of r and p.

https://doi.org/10.1371/journal.pone.0145032.t001

Fig 1 gives the simulated powers of the eight test statistics against r under H1 : pmpf and ρ > 0 for different values of ρ (0.05 and 0.1) and N (800 and 1200), having pm = 0.3 and pf = 0.35. It is shown in the figure that LRT0b is more powerful than LRT0 and Z0, and LRT0 and Z0 have the similar performance in power (Fig 1a-1d in the first row), regarded of the inbreeding coefficient ρ, the sample size N and the ratio r. LRT1 and Z1 have almost the same performance in power (Fig 1e-1h in the second row). LRT2b has much more power than LRT2 and Z2, and LRT2 is a little less powerful than Z2 (Fig 1i-1l in the third row). The powers of LRT1 and Z1 are not so affected by the different values of r, while LRT0, LRT0b, Z0, LRT2b, LRT2 and Z2 are more and more powerful with the number of female individuals increasing (r changing from 2:1 to 1:2) when other parameters are fixed. We also find that the powers of LRT0, LRT0b, Z0, LRT2, LRT2b and Z2 appear great reaction to the different values of ρ when N is fixed. Specially, their powers under ρ = 0.1 (subplots in the second and fourth columns, respectively) are much larger than those under ρ = 0.05 (subplots in the first and third columns, respectively). However, the powers of LRT1 and Z1 are almost not influenced by ρ. Further, it can be seen in Fig 1 that LRT0, LRT0b and Z0 with two degrees of freedom (subplots in the first row) are much more powerful than LRT1, Z1, LRT2, LRT2b and Z2 with one degree of freedom (subplots in the second and third rows). This is because the true model is pmpf and ρ > 0. In addition, when the sample size changes from 800 (subplots in the first and second columns) to 1200 (subplots in the third and fourth columns), all the test statistics are much more powerful.

thumbnail
Fig 1. Simulated powers of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 against r = Nm : Nf under H1 : pmpf and ρ > 0 based on 10000 replicates with pm = 0.3 and pf = 0.35.

In the first column: ρ = 0.05 and N = 800; in the second column: ρ = 0.1 and N = 800; in the third column: ρ = 0.05 and N = 1200; in the fourth column: ρ = 0.1 and N = 1200. In the first row, the powers of LRT0, LRT0b and Z0 for H0 : pm = pf and ρ = 0; in the second row, the powers of LRT1 and Z1 for H01 : pm = pf; in the third row, the powers of LRT2, LRT2b and Z2 for H02 : ρ = 0.

https://doi.org/10.1371/journal.pone.0145032.g001

Fig 2 displays the simulated size/powers of the eight test statistics against r under H02 : ρ = 0 for different values of pf, having pm = 0.3 and N = 1200. The results in the third row of the figure are the size of LRT2, LRT2b and Z2, while those in the first and the second rows of the figure are the powers of LRT0, LRT0b and Z0, and those of LRT1 and Z1, respectively. It is shown in the figure that the size of LRT2b and Z2 maintains close to the nominal 5% level, while LRT2 is too conservative. As for the tests for H01 : pm = pf, LRT1 and Z1 almost have the same simulated power just like Fig 1. On the other hand, the powers of LRT0, LRT0b, Z0, LRT1 and Z1 are not so affected by the ratio r. However, their powers are greatly influenced by the absolute difference |ϵ| = |pmpf|. Specifically, their powers under pf = 0.25 and pf = 0.35 are much larger than those under pf = 0.26 and pf = 0.34. In addition, when the simulation setting is fixed, LRT1 and Z1 with one degree of freedom are a little more powerful than LRT0, LRT0b and Z0 with two degrees of freedom, because the true model is pmpf and ρ = 0. By comparing Fig 2d (ρ = 0), Fig 1c (ρ = 0.05) and Fig 1d (ρ = 0.1) under N = 1200, pm = 0.3 and pf = 0.35, LRT0, LRT0b and Z0 are more and more powerful with ρ increasing.

thumbnail
Fig 2. Simulated size/powers of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 against r = Nm : Nf under H02 : ρ = 0 based on 10000 replicates with pm = 0.3 and N = 1200.

In the first column: pf = 0.25; in the second column: pf = 0.26; in the third column: pf = 0.34; in the fourth column: pf = 0.35. In the first row, the powers of LRT0, LRT0b and Z0 for H0 : pm = pf and ρ = 0; in the second row, the powers of LRT1 and Z1 for H01 : pm = pf; in the third row, the size of LRT2, LRT2b and Z2 for H02 : ρ = 0.

https://doi.org/10.1371/journal.pone.0145032.g002

Figs A–G in S1 File show the corresponding results under other simulation settings with pmpf and ρ > 0, which are similar to those in Fig 1. Figs H and I in S1 File plot the corresponding results under pm = pf and ρ > 0, and Figs J–L in S1 File give the corresponding results under pmpf and ρ = 0. The more details refer to S1 File.

Table 2 shows the simulated size of LRT2, LRT2b and Z2 for H02 : ρ = 0 for different values of pf, having Nm = 0 under the sample sizes Nf required for LRT2b to obtain 80% simulated power. Table 3 lists the simulated powers under these sample sizes for different values of pf, having ρ = 0.05 and 0.1. From Table 2, we can see that the type I error rates of LRT2, LRT2b and Z2 are close to the nominal significance level of 5%. It is shown in Table 3 that the power of LRT2b attains to about 80%, and the difference in power between LRT2b and Z2 is about 10%.

thumbnail
Table 2. Simulated size (in %) of LRT2, LRT2b and Z2, having Nm = 0 and ρ = 0.

https://doi.org/10.1371/journal.pone.0145032.t002

thumbnail
Table 3. Simulated powers (in %) of LRT2, LRT2b and Z2, having Nm = 0.

https://doi.org/10.1371/journal.pone.0145032.t003

Tables A–J in S1 File list the RMSEs and biases of the estimates of pm, pf, the pooled allele frequency p and ρ for different values of pm, pf, ρ, r and N. It should be noted that the estimate of pm based on the EM algorithm is the same as zheng et al. [14]. Further, the estimates and of pf and p based on the EM algorithms have the similar RMSEs and biases as those from zheng et al. [14], respectively. However, when we focus on the estimate of ρ, we find that although the biases of and based on the EM algorithms are larger than in zheng et al. [14] for some cases, the RMSEs of and are smaller than for all the simulation settings.

Table 4 displays the simulated size/powers of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 under the population stratification model. When ϵ1 = ϵ2 = 0, the size of LRT1 and Z1 is obtained. Further, note that the ratios of two subpopulations in the whole population are equal. As such, ϵ1 = −ϵ2 will also cause the size of LRT1 and Z1. Under other simulation settings, we get the powers of the eight test statistics. To investigate whether or not the population stratification model causes excess homozygosity, we save the values of 10000 ρ estimates for each estimation method (, or ). Then, calculate the corresponding mean and standard deviation (SD), which are also listed in Table 4. The results show that the population stratification model indeed leads to the positive inbreeding coefficient (i.e., excess homozygosity), which is consistent with Overall and Nichols [24]. The mean values ( and ) using the EM algorithm are a little larger than proposed in zheng et al. [14], while and have less standard deviation. On the other hand, the size of LRT1 and Z1 is close to the nominal significance level of 5%. The power of LRT0b is larger than LRT0 and Z0, and LRT0 and Z0 have the similar powers, irrespective of the ϵ1 and ϵ2 values. LRT1 and Z1 have almost the same powers. LRT2b is much more powerful than LRT2 and Z2, and the power of LRT2 is a little smaller than Z2. If ϵ1 is fixed and ϵ2 is changed, the ρ estimate increases with ϵ2 increasing, and hence LRT2, LRT2b and Z2 are more and more powerful; if ϵ2 is fixed and ϵ1 is changed, the ρ estimate decreases with ϵ1 increasing, and hence LRT2, LRT2b and Z2 are less and less powerful. This may be caused by pm being taken to be 0.3 and 0.5 in the first and second subpopulations, respectively.

thumbnail
Table 4. Mean and standard deviation (SD) of ρ estimates over 10000 replications, and simulated size/powers (in %) of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 under population stratification model.

https://doi.org/10.1371/journal.pone.0145032.t004

Application to RA data

We apply the proposed methods to the RA dataset from North American Rheumatoid Arthritis Consortium for studying their practicability, which is available from Genetic Analysis Workshop 15. In this dataset, there are 1217 families. Note that many individuals’ genotypes are missing. On the other hand, to obtain a sample of which all the individuals are independent, we only select the available founders in this dataset, which results in a sample composed of 369 founders (Nm = 112 and Nf = 257) in the analysis. 293 SNP markers on X chromosome for each founder are included in this application. The significance level is fixed at α = 5%. Table 5 gives the corresponding results based on the P-values of LRT0b, LRT1, LRT2b, Z0, Z1 and Z2. From Table 5, LRT0b identified 6 loci which Z0 did not identify, and Z0 identified 4 additional loci. One locus is detected by LRT1 that is not found by Z1, and 4 additional loci are detected by Z1. There are 12 loci identified by LRT2b, which can not be identified by Z2, and only 2 additional loci are identified by Z2. However, there exist multiple testing problems because we simultaneously analyze 293 loci. So, Bonferroni correction is used (α′ = 0.05/293 = 1.71 × 10−4) and there is no statistically significant result to occur. The more details can be found in Tables K–M in S1 File.

thumbnail
Table 5. LRT0b, LRT1, LRT2b, Z0, Z1 and Z2 results of application to rheumatoid arthritis data at 5% level.

https://doi.org/10.1371/journal.pone.0145032.t005

To investigate the computational efficiency of the XHWE software, we implement the code with the default arguments for this dataset (1217 families and 293 SNPs), on a HP 2311f personal computer (Microsoft Windows 7 Enterprise (Service Pack 1), 4GB of RAM and 3.40 GHz Intel(R) Core(TM) i7 Duo processor) and record its computational time. This process needs 977 seconds. Therefore, on the average, the running time for a single SNP is about 3.3 seconds. For the genome-wide case, for example, one would analyze 200000 SNP markers on X chromosome for the family sample of the type mentioned above, which would lead to 1600000 tests for the hypotheses with running time being about 7.6 days on the personal computer of this type.

Discussion

The existing Z1 and Z2 tests were respectively proposed to test for H01 : pm = pf and H02 : ρ = 0. However, we find that there is no simulation study conducted to assess the validity of Z1 and Z2 and their performance [14]. Further, there is no existing method to simultaneously test for H0 : pm = pf and ρ = 0. Therefore, in this article, we first combine these two test statistics and suggest Z0 = Z1 + Z2 to test for the equality of the frequencies of the same allele in males and females and the zero inbreeding coefficient on X chromosome based on the collected sample, because Z1 and Z2 are independent of each other. What’s more, for the purpose of improving the test power, we propose several LRT-type test statistics. Firstly, we write out the likelihood functions under H0 : pm = pf and ρ = 0 and H1 : pmpf or ρ > 0 at a single SNP locus on X chromosome, respectively. Then, we obtain the MLEs of the male allele frequency, the female allele frequency and the inbreeding coefficient by the EM algorithms, where we use the RMSE and bias to assess the accuracy of the MLEs of these unknown parameters and construct the corresponding likelihood ratio test (LRT0) statistic under the null hypothesis H0. If H0 is statistically rejected, we further develop two LRT-type test statistics LRT1 and LRT2 respectively for H01 : pm = pf and H02 : ρ = 0. Note that LRT0 and LRT2 are too conservative from the simulated results. So, we use parametric bootstrap techniques and propose the LRT0b and LRT2b test statistics. We simulate the data under different parameter settings. Simulation results show that the proposed bootstrap-based methods LRT0b and LRT2b, LRT1, Z0 and the existing Z1 and Z2 control the type I error rates well under the respective null hypothesis. Power comparison demonstrates that LRT0b is more powerful than both LRT0 and Z0. Under ρ > 0, LRT2b has much more power than LRT2 and Z2, and LRT2 is a little less powerful than Z2. In addition, LRT1 and Z1 almost have the same power under pmpf.

As for the parameter estimates, the estimate of pm based on the EM algorithm is the same as that in zheng et al. [14]. Further, the estimates and of pf and the pooled allele frequency p based on the EM algorithms have the RMSE and bias similar to those from zheng et al. [14], respectively. However, although the biases of and based on the EM algorithms are larger than from zheng et al. [14] for some cases, the RMSEs of and are smaller than for all the simulation settings. In addition, the population stratification model indeed causes excess homozygosity, which is consistent with Overall and Nichols [24]. The mean values ( and ) using the EM algorithm are a little larger than proposed in zheng et al. [14], while and have less standard deviation.

Note that ρ = 0 and ρ > 0 in the null and alternative hypotheses of the likelihood ratio test LRT0 or LRT2, respectively, which causes the “boundary” problem and that the corresponding likelihood ratio test is not expected to follow a χ2 distribution [31, 33]. This may be the reason why the size of LRT0 and LRT2 is too conservative. Therefore, we use parametric bootstrap techniques to obtain the exact distributions of LRT0 and LRT2 in this article.

Due to the presence of the X chromosome inactivation (XCI) and dosage compensation (DC), association analysis and excess homozygosity tests on X chromosome are more complicated than those on autosomes [34]. In the presence of XCI, only one allele from a pair of alleles in females is expressed [35]. Consequently, if considering a locus with two alleles M1 and M2, the effect of the M1 allele in males should be equivalent to the difference between M2M2 and M1M1 homozygous females. As such, when we conduct analyses based on allele-counting, we must either count each allele twice in males or equivalently count each allele in females as 0.5, reflecting a “dosage compensation” for X inactivation [34]. It should be noted that LRT2, LRT2b and Z2 for H02 : ρ = 0 are not affected by XCI and DC because they only use female individuals in the collected sample. Similarly, Z1 for H01 : pm = pf is also not influenced by XCI and DC because it estimates the allele frequencies and the corresponding variances in males and females, respectively. Thus, Z0 = Z1 + Z2 is still valid when XCI and DC exist. To investigate the effect of XCI and DC on LRT0, LRT0b, LRT1 and LRT1b, where LRT1b is the bootstrap version of LRT1, we carry out simulation study under several simulation settings in the presence of XCI and DC. The simulation settings and simulation results are listed in Table 6. It is shown in the table that the size of LRT2b, Z0, Z1 and Z2 stays close to the nominal 5% level and the size of LRT2 is still conservative. However, LRT0 and LRT1 without bootstrap cannot control the size well. Fortunately, the type I error rates of LRT0b and LRT1b with bootstrap are very close to 5%. Furthermore, LRT0b is more powerful than Z0 almost for all the cases and LRT1b and Z1 almost have the same performance in power. Therefore, in the presence of XCI and DC, LRT0b, Z1 and LRT2b are recommended. Finally, LRT0b and LRT2b can deal with samples of small size. However, LRT0b and LRT2b are based on the parametric bootstrap techniques, which are more computationally intensive.

thumbnail
Table 6. Simulated size/powers (in %) of LRT0, LRT0b, LRT1, LRT1b, LRT2, LRT2b, Z0, Z1 and Z2 based on 10000 Monte Carlo replications and 1000 bootstrap replications under X chromosome inactivation and dose compensation, having pm = 0.3 and the ratio Nm : Nf = 1: 1.

https://doi.org/10.1371/journal.pone.0145032.t006

Supporting Information

S1 File. Supporting Information.

Tables A–J, root mean squared errors (RMSE) and biases of estimates of pm, pf and ρ based on EM algorithm and zheng et al. [14] under different simulation settings. Tables K–M, LRT0, LRT0b, Z0, LRT1, Z1, LRT2, LRT2b, and Z2 results of application to rheumatoid arthritis data, respectively. Figs A–L, simulated size/powers of LRT0, LRT0b, LRT1, LRT2, LRT2b, Z0, Z1 and Z2 against r = Nm : Nf based on 10000 replicates under different simulation settings.

https://doi.org/10.1371/journal.pone.0145032.s001

(PDF)

Author Contributions

Conceived and designed the experiments: XPY JYZ. Performed the experiments: XPY JYZ. Analyzed the data: XPY QLZ JLL. Contributed reagents/materials/analysis tools: XPY QLZ JLL JYZ. Wrote the paper: XPY JYZ.

References

  1. 1. Horvath S, Xu X, Laird NM. The family based association test method: strategies for studying general genotype—phenotype associations. Eur J Hum Genet. 2001; 9: 301–306. pmid:11313775
  2. 2. Lenart BA, Neviaser AS, Lyman S, Chang CC, Edobor-Osula F, Steele B, et al. Association of low-energy femoral fractures with prolonged bisphosphonate use: a case control study. Osteoporosis Int. 2009; 20: 1353–1362.
  3. 3. Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol. 2001; 20: 4–16. pmid:11119293
  4. 4. Samani NJ, Erdmann J, Hall AS, Hengstenberg C, Mangino M, Mayer B, et al. Genome wide association analysis of coronary artery disease. N Engl J Med. 2007; 357: 443–453. pmid:17634449
  5. 5. Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP, et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet. 2010; 42: 579–589. pmid:20581827
  6. 6. Wang M, Lin S. FamLBL: detecting rare haplotype disease association based on common SNPs using case-parent triads. Bioinformatics. 2014; 30: 2611–2618. pmid:24849576
  7. 7. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008; 40: 575–583. pmid:18391952
  8. 8. Zhang H, Wheeler W, Wang Z, Taylor PR, Yu K. A fast and powerful tree-based association test for detecting complex joint effects in case-control studies. Bioinformatics. 2014; 30: 2171–2178. pmid:24794927
  9. 9. Schaid DJ, Jacobsen SJ. Blased tests of association: comparisons of allele frequencies when departing from Hardy-Weinberg proportions. Am J Epidemiol. 1999; 149: 706–711. pmid:10206619
  10. 10. Chung RH, Morris RW, Zhang L, Li YJ, and Martin RR. X-APL: an improved family-based test of association in the presence of linkage for the X chromosome. Am J Hum Genet. 2007; 80: 59–68. pmid:17160894
  11. 11. Ding J, Lin S, Liu Y. Monte Carlo pedigree disequilibrium test for markers on the X chromosome. Am J Hum Genet. 2006; 79: 567–573. pmid:16909396
  12. 12. Horvath S, Laird NM, Knapp M. The transmission/disequilibrium test and parental-genotype reconstruction for X-chromosomal markers. Am J Hum Genet. 2000; 66: 1161–1167. pmid:10712229
  13. 13. Zhang L, Martin ER, Chung RH, Li YJ, Morris RW. X-LRT: a likelihood approach to estimate genetic risks and test association with X-linked markers using a case-parents design. Genet Epidemiol. 2008; 32: 370–380. pmid:18278816
  14. 14. Zheng G, Joo J, Zhang C, Geller NL. Testing association for markers on the X chromosome. Genet Epidemiol. 2007; 31: 834–843. pmid:17549761
  15. 15. Clayton D. Testing for association on the X chromosome. Biostat. 2008; 9: 593–600.
  16. 16. Hickey PF, Bahlo M. X chromosome association testing in genome wide association studies. Genet Epidemiol. 2011; 35: 664–670. pmid:21818774
  17. 17. Chen Z, Ng HKT, Li J, Liu Q, Huang H. Detecting associated single-nucleotide polymorphisms on the X chromosome in case control genome-wide association studies. Stat Methods Med Res. 2014;
  18. 18. Li CC. First Course in Population Genetics. Boxwood Press; 1976.
  19. 19. Gillespie JH. Population Genetics: A Concise Guide. 1st ed. Johns Hopkins University Press; 2010.
  20. 20. König IR, Loley C, Erdmann J, Ziegler A. How to include chromosome X in your genome-wide association study. Genet Epidemiol. 2014; 38: 97–103.
  21. 21. Chang D, Gao F, Slavney A, Ma L, Waldman YY, Sams AJ, et al. Accounting for eXentricities: analysis of the X chromosome in GWAS reveals X-linked genes implicated in autoimmune diseases. PLoS ONE. 2014; 9: e113684. pmid:25479423
  22. 22. Wright S. Systems of mating. Genetics. 1921; 6: 111–178. pmid:17245958
  23. 23. Nei M. Molecular evolutionary genetics. Columbia University Press, New York; 1987.
  24. 24. Overall ADJ, Nichols RA. A method for distinguishing consanguinity and population substructure using multilocus genotype data. Mol Biol Evol. 2001; 18: 2048–2056. pmid:11606701
  25. 25. Brookfield JFY. A simple new method for estimating null allele frequency from heterozygote deficiency. Mol Ecol. 1996; 5: 453–455. pmid:8688964
  26. 26. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977; 39: 1–38.
  27. 27. Mao WG, He HQ, Xu Y, Chen PY, Zhou JY. Powerful haplotype-based Hardy-Weinberg equilibrium tests for tightly linked loci. PLoS ONE. 2013; 8: e77399. pmid:24167573
  28. 28. Emigh TH. A comparison of tests for Hardy-Weinberg equilibrium. Biometrics. 1980; 36: 627–642. pmid:25856832
  29. 29. Kuk AYC, Zhang H, Yang Y. Computationally feasible estimation of haplotype frequencies from pooled DNA with and without Hardy-Weinberg equilibrium. Bioinformatics. 2009; 25: 379–386. pmid:19050036
  30. 30. Gartler SM. A brief history of dosage compensation. J Genet. 2014; 93: 591–595. pmid:25189265
  31. 31. Weir BS, Cockerham CC. Complete Characterization of Disequilibrium at Two Loci. Mathematical Evolutionary Theory, Princeton University Press; 1989.
  32. 32. Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc. 1987; 82: 605–610.
  33. 33. Zhang DW, Lin XH. Hypothesis testing in semiparametric additive mixed models. Biostat. 2003; 4: 57–74.
  34. 34. Bielawski JP, Yang Z. Maximum Likelihood Methods for Detecting Adaptive Protein Evolution. Statistical Methods in Molecular Evolution. Springer; 2005.
  35. 35. Clayton DG. Sex chromosomes and genetic association studies. Genome Med. 2009; 1: 110. pmid:19939292
  36. 36. Amos-Landgraf JM, Cottle A, Plenge RM, Friez M, Schwartz CE, Longshore J, et al. X chromosome-inactivation patterns in 1,005 phenotypically unaffected females. Am J Hum Genet. 2006; 79: 493–499. pmid:16909387