Bayesian Linkage Analysis of Categorical Traits for Arbitrary Pedigree Designs

Abra Brisbin; Myrna M. Weissman; Abby J. Fyer; Steven P. Hamilton; James A. Knowles; Carlos D. Bustamante; Jason G. Mezey

doi:10.1371/journal.pone.0012307

Abstract

Background

Pedigree studies of complex heritable diseases often feature nominal or ordinal phenotypic measurements and missing genetic marker or phenotype data.

Methodology

We have developed a Bayesian method for Linkage analysis of Ordinal and Categorical traits (LOCate) that can analyze complex genealogical structure for family groups and incorporate missing data. LOCate uses a Gibbs sampling approach to assess linkage, incorporating a simulated tempering algorithm for fast mixing. While our treatment is Bayesian, we develop a LOD (log of odds) score estimator for assessing linkage from Gibbs sampling that is highly accurate for simulated data. LOCate is applicable to linkage analysis for ordinal or nominal traits, a versatility which we demonstrate by analyzing simulated data with a nominal trait, on which LOCate outperforms LOT, an existing method which is designed for ordinal traits. We additionally demonstrate our method's versatility by analyzing a candidate locus (D2S1788) for panic disorder in humans, in a dataset with a large amount of missing data, which LOT was unable to handle.

Conclusion

LOCate's accuracy and applicability to both ordinal and nominal traits will prove useful to researchers interested in mapping loci for categorical traits.

Citation: Brisbin A, Weissman MM, Fyer AJ, Hamilton SP, Knowles JA, Bustamante CD, et al. (2010) Bayesian Linkage Analysis of Categorical Traits for Arbitrary Pedigree Designs. PLoS ONE 5(8): e12307. https://doi.org/10.1371/journal.pone.0012307

Editor: Katharina Domschke, University of Muenster, Germany

Received: January 20, 2010; Accepted: July 27, 2010; Published: August 26, 2010

Copyright: © 2010 Brisbin et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study was supported in part by National Institutes of Health (NIH), www.nih.gov, grant R01 GM083606 and National Institute of Mental Health (NIMH), www.nimh.nih.gov, grants MH28274 (to MMW), MH37592 (to AJF), MH30906 (to DFK and AJF), and MH48858 (to SEH). Genotyping services were provided to JAK by the Center for Inherited Disease Research (CIDR). CIDR is fully funded through a federal contract from the NIH to Johns Hopkins University, contract number N01-HG-65403. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: In the past 3 years, Dr. Weissman received investigator-initiated grants from GlaxoSmithKline, Eli Lilly & Co., and the Josiah Macy Foundation. These are no longer active and ended in 2007. She currently receives research funding from the National Institute of Mental Health (NIMH), the National Institute on Drug Abuse (NIDA), the National Alliance for Research on Schizophrenia and Depression (NARSAD), the Sackler Foundation, and the Interstitial Cystitis Association, and she receives royalties from the Oxford University Press, Perseus Press, the American Psychiatric Association Press, and MultiHealth Systems. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

Introduction

Many heritable traits, from pathogen resistance in plants [1] to panic disorder in humans [2], are described using discrete categories such as color or are quantified using discrete, ordered scales such as “mildly,” “moderately,” or “severely” affected. When performing linkage analysis of categorical traits, it is well appreciated that re-coding measurements as binary can lead to decreased power [3], [4]. Recoding measurements as continuous can lead to the same problem. Use of the most widely applied software for linkage analysis such as Superlink [5], Merlin [6], Genehunter [7], and LOKI [8] that do not employ categorical trait models is therefore not the most appropriate strategy for analyzing categorical diseases.

Most previous work done on family-based mapping of categorical traits has been restricted to particular types of pedigrees; these include backcross [9]–[11] and F2 designs [10]–[13], 4-way experimental crosses [1], [14]–[16], and sets of independent nuclear families [17]–[19]. Recent methods by Zhang et al. [20], Dupuis et al. [21], and Diao and Lin [22] allow linkage analysis for ordinal traits on arbitrary pedigrees. To date, there is no Bayesian framework for both ordinal and nominal linkage analysis on pedigrees with inbreeding loops and missing data.

In this paper, we develop a Bayesian statistical framework for linkage analysis of a categorical trait with a user-specified penetrance function of arbitrary form. We implement this framework in the software LOCate (Linkage for Ordinal and Categorical traits). Our method can analyze an ordinal or nominal trait with any number of categories, can handle missing genotype and phenotype data, and can analyze pedigrees with inbreeding loops. We demonstrate the versatility of our method's user-specified penetrance function through analysis of simulated pedigrees with a nominal trait, and find that our method outperforms LOT [20], the method of Zhang et al., which is designed for ordinal traits. We further demonstrate the versatility of our method by reanalyzing a study of panic disorder in humans previously analyzed as a binary trait [2], in which many individuals have missing phenotypes. After we cut some of the pedigrees for memory considerations, our method was able to analyze the data and find evidence to reject a particular trichotomous penetrance model, while LOT was unable to handle the large amount of missing data in this study.

Methods

In our linkage analysis framework, we seek the probability of a pedigree conditional on , the recombination rate between a single marker locus and the unknown disease locus:where the observed data X consists of individuals' phenotypes and unphased marker genotypes, and the unobserved data Y consists of all individuals' disease locus and phased marker genotypes, as well as any unobserved phenotypes and unphased marker genotypes. As the number of individuals in the family increases, the sum over all possible genotype assignments Y can grow unwieldy. Instead of considering all possible values of Y, Gibbs sampling is used to randomly explore the space of genotype configurations, emphasizing those configurations Y which have the highest values of , and therefore contribute the most to the summation. Below, we describe the model, demonstrate the use of simulated tempering to improve the mixing of the Gibbs sampler, and introduce a novel estimator for the likelihood of the data from Gibbs sampling.

The model

Figure 1 shows the graphical model for our Gibbs sampler. Following this model, the joint probability of the observed data (X) and unobserved data (Y), conditional on the recombination rate , is as follows:(1)where are the disease alleles individual i received from its father and mother; are the marker alleles i received from its father and mother; and are “selector” variables that tell whether i received the grandpaternal or grandmaternal allele from each parent at the disease locus and the marker, respectively; is i's observed, unphased marker genotype; is i's phenotype; and penetrance refers to the matrix of used to model the disease. HWE refers to the genotype frequencies assuming the founders are drawn from a population under Hardy-Weinberg Equilibrium.

Download:

Figure 1. The graphical model for the Gibbs sampler.

All variables shown here are involved in updating the information for individual i. Filled-in variables are typically observed, and held constant throughout the run of the sampler. marker alleles that i received from its father and mother. disease locus alleles that i received from its father and mother. , marker and disease locus alleles that individual i passed to its jth offspring. (Only one offspring is shown for illustration.) individual i's phenotype. Selector variable: tells whether i's paternal marker allele comes from its paternal grandfather or grandmother. 's unphased marker genotype. , marker genotype vectors of i's mother and father. If i is a founder, replace by a constant node describing the population allele frequencies. Penetrances = matrix of the probabilities of each phenotype, conditional on disease genotype; held constant.

https://doi.org/10.1371/journal.pone.0012307.g001

We derived a Gibbs sampler to sample genotype configurations Y in proportion to the probability in Equation 1. In our Bayesian implementation, we used a uniform prior on the marker genotypes of individuals with missing data. We also used , which assumes unbiased inheritance, e.g., no meiotic drive. With the availability of additional information, it would be straightforward to change these priors. The penetrance parameters, which describe the probability of each phenotype category conditional on each disease locus genotype, are assumed to have a point prior, that is, to be fixed. It would also be possible to implement a random exploration of penetrance parameters and values within the Gibbs sampler; however, this would greatly increase the size of the sample space. Therefore, to maximize computational efficiency, we used a grid of values for in the current implementation.

The Gibbs sampler updates each set of variables conditional on its Markov blanket [23]. The equations for the updates are given below and in Text S1. For example, individual i's marker alleles and selectors , , , are updated by a draw from the distribution(2)where indicates the vector of marker alleles held by i's father in the current iteration.

Here,(3)In setting if is unobserved, we assume that this individual's genotype had probability 1 of being unobserved, independent of the individual's true phased genotype. If another model for gene dropouts were available, it could be employed here.

Also,where the mutation rate depends on the current “temperature” of simulated tempering (see below). The calculation of for each of i's offspring is analogous to this. If individual i's parents are not included in the pedigree, then i is a founder, and is replaced by , where m is the number of distinct marker alleles.

Improving the Speed of the Method

Slow mixing is a chronic problem in Gibbs samplers for linkage analysis [24], [25]. This can result in inadequate exploration of the sample space and excessively long times to reach the stationary distribution. Even more of a concern is the fact that in cases with missing marker data and more than two possible marker alleles, the Markov chain may be reducible, rendering portions of the sample space inaccessible from a given starting point [25], [26].

To ameliorate this problem, we implemented simulated tempering [27], [28] in our Gibbs sampling algorithm. In simulated tempering, the Markov chain is run at several different “temperatures” , ranging from , at which the chain's stationary distribution is the desired probability distribution, to , at which the chain's distribution is very “relaxed,” or smoothed, to increase the chance of the chain traversing regions of low probability density to reach different modes of the distribution. The most common way of relaxing the probability distribution is to raise the distribution to a power; however, this method is ineffective when some states to be traversed have probability zero. Geyer and Thompson [27] used an alternative approach to simulated tempering in their investigation of carrier status for cystic fibrosis in a large pedigree of Hutterites. Instead of raising the distribution to a power, they varied the disease penetrances at different values of . We extended their approach to a more general parameter relaxation, in which each value of features its own penetrances, recombination rate, mutation rate, and disease-allele frequency (see Text S1). This greatly improved the mixing (Figure S5) and time to stationarity (Figure S6) of our Gibbs sampler.

Estimating the LOD Curve

While results of an analysis using our framework may be interpreted entirely from a Bayesian perspective by assuming a prior over the grid values of , we wished to provide a log of odds (LOD) score for convenient linkage assessment. Likelihood-based parameter inference from Markov chain Monte Carlo is prone to sampling bias [26], [29]. To avoid this bias, we developed a linear regression-based estimator (LinReg) which takes advantage of the relationThe numerator can be computed exactly (Equation 1). We estimate the denominator by the proportion of iterations which visit each configuration Y. The LinReg estimator of is the slope of the best-fit line (with intercept 0) through a plot of vs , as shown in Figure 2.

Download:

Figure 2. The Linear Regression estimator of

.

X = observed data, Y = unobserved data. is calculated using Equation 1; is estimated by the proportion of iterations which visit configuration Y, given the observed genotypes X. The slope of the regression line (red) is an estimate of .

https://doi.org/10.1371/journal.pone.0012307.g002

Simulations

We assessed the performance of our method using two sets of simulated data. First, we tested the accuracy of LOD score estimation for single, small simulated pedigrees. Since any errors that occur in the analysis of one pedigree will be multiplied when multiple pedigrees are aggregated in a typical linkage analysis study, it is important that our method perform accurately when only a small amount of data is available. The simulated pedigrees included from 4 to 18 individuals; some examples are shown in Figure S1. These included pedigrees with missing genotype data and with inbreeding loops. For each pedigree, we simulated either a recessive binary trait with and , or a complete-penetrance codominant trichotomous trait (). We computed the LOD scores for these pedigrees using the slightly misspecified disease penetrances in Table 1. We compared our estimated LOD scores to the theoretical LOD scores using the true penetrances, as well as to the LOD scores obtained by treating the trichotomous trait as binary (in Superlink [5]) or continuous (in Merlin [6] and SOLAR [30]). Parameter settings used for these programs are given in Text S1.

Download:

Table 1. Penetrance models used in our small-family simulations.

https://doi.org/10.1371/journal.pone.0012307.t001

For our second set of simulations, we assessed the ability of our method to detect linkage in cases where the pedigree(s) may be reasonably broken into a large number of small family groups or where the study includes a large number of small families. For these simulations, we considered linkage studies of 100 families, each family consisting of 2 parents and 2 offspring. We simulated a trichotomous trait with penetrances as given in Table 2 (Model A). The trait locus was either tightly linked () or unlinked () to the observed marker locus. Both the disease locus and marker locus were simulated to be diallelic, with the marker allele frequencies = .5 and the disease allele frequency = .25. We required that each simulated family be informative for linkage (at least one parent heterozygous at the marker) and exhibit at least 2 levels of the phenotype among its 4 members. We simulated 100 such studies, and examined the power vs. type I error of our method and that of LOT [20]. Because LOCate requires an estimate of the penetrances as input, we tested our method with a range of penetrances (Table 2, Models A, B, C).

Download:

Table 2. Penetrance models used to analyze simulated linkage studies.

https://doi.org/10.1371/journal.pone.0012307.t002

Application to Data

Panic disorder is a common illness in humans, characterized by periods of intense anxiety. Because individuals exhibit varying degrees of symptoms of panic disorder, this psychiatric illness is a natural choice for analysis as an ordinal trait. We used LOCate to perform trichotomous linkage analysis on the panic disorder data set of Fyer et al. [2]. This dataset consists of 1591 individuals in 120 pedigrees, classified into six categories: definitely affected by panic disorder, probably affected, possibly affected, any symptoms of panic, unaffected, or unknown. The dataset has missing data among both phenotypes and microsatellite marker genotypes. Fyer et al. analyzed these data using ANALYZE [31] and MLINK [32], [33] using the binary penetrance model shown in Table 3, and found a two-point HLOD(.2) = 3.20 at marker D2S1788, with HLOD computed asover a grid of values. We reanalyzed marker D2S1788 using LOCate, under the same binary penetrance model and under four trichotomous variations of this model (Table 3 and Table S1).

Download:

Table 3. Penetrance models used in our analysis of Panic Disorder data.

https://doi.org/10.1371/journal.pone.0012307.t003

In each of the variations, we used a low (.01 or .1) phenocopy rate, similar to the .01 rate used in Fyer et al. We varied from (.5,.5) in model A, matching Fyer et al., to (.05,.05) in model D, to represent a disease which is much more penetrant when individuals with “any symptoms” are included as affected.

Seven pedigrees had no observed phenotypes, due to having been collected for a different phase of the Fyer et al. study. Nine additional pedigrees had some observed genotypes, but were uninformative due to lack of variation in the observed phenotypes or genotypes. These pedigrees were dropped from our analysis, leaving 1332 individuals in 104 families. Of these, 35 families, ranging in size from 4 to 10 individuals, could be analyzed in LOCate on 1.7 GB-memory instances on the Amazon cloud [34]. The remaining 69 pedigrees, ranging in size from 9 to 34 individuals, would have required more than 1.7 GB of memory. We split these pedigrees into nuclear families for analysis, discarding subpedigrees which had no variation in observed phenotype or marker alleles or fewer than 2 individuals with observed genotypes, and discarding individuals without offspring which had neither observed phenotype nor genotype. After cutting, the dataset consisted of 167 pedigrees and subpedigrees, comprising 858 unique individuals.

Using LOCate, we first analyzed a reduced set of 96 subfamilies to compare 4 trichotomous penetrance models (Table S1), and then re-analyzed the full set of pedigrees using the best-fitting penetrance model (Table 3). Using multiple penetrance models is a form of multiple testing, so we must increase the LOD score threshold used to declare significance. A Bonferroni correction gives the adjusted threshold as , where n is the number of penetrance models; in this case, the threshold is . Other, less conservative approaches to correction would also be possible, such as Rom's correction [35] or determining empirical p-values by permuting phenotypes [36].

We also attempted to analyze the cut pedigrees using LOT, but found that LOT froze during this analysis. Test analyses with simulated phenotypes on the same pedigree structures revealed that this was due to the large proportion (32%) of individuals with unobserved phenotypes.