Using Network Methodology to Infer Population Substructure

Dmitry Prokopenko; Julian Hecker; Edwin Silverman; Markus M. Nöthen; Matthias Schmid; Christoph Lange; Heide Loehlein Fier

doi:10.1371/journal.pone.0130708

Abstract

One of the main caveats of association studies is the possible affection by bias due to population stratification. Existing methods rely on model-based approaches like structure and ADMIXTURE or on principal component analysis like EIGENSTRAT. Here we provide a novel visualization technique and describe the problem of population substructure from a graph-theoretical point of view. We group the sequenced individuals into triads, which depict the relational structure, on the basis of a predefined pairwise similarity measure. We then merge the triads into a network and apply community detection algorithms in order to identify homogeneous subgroups or communities, which can further be incorporated as covariates into logistic regression. We apply our method to populations from different continents in the 1000 Genomes Project and evaluate the type 1 error based on the empirical p-values. The application to 1000 Genomes data suggests that the network approach provides a very fine resolution of the underlying ancestral population structure. Besides we show in simulations, that in the presence of discrete population structures, our developed approach maintains the type 1 error more precisely than existing approaches.

Citation: Prokopenko D, Hecker J, Silverman E, Nöthen MM, Schmid M, Lange C, et al. (2015) Using Network Methodology to Infer Population Substructure. PLoS ONE 10(6): e0130708. https://doi.org/10.1371/journal.pone.0130708

Editor: Dana C. Crawford, Case Western Reserve University, UNITED STATES

Received: January 15, 2015; Accepted: May 22, 2015; Published: June 22, 2015

Copyright: © 2015 Prokopenko et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All 1000 Genomes data are available from http://www.1000genomes.org/.

Funding: The project described was supported by Cure Alzheimer's fund, Award Number (R01MH081862, R01MH087590) from the National Institute of Mental Health and Award Number (U01HL089856, U01HL089897) from the National Heart, Lung, and Blood Institute. The research of C. Lange was supported by the National Research Foundation of Korea Grant, funded by the Korean Government (NRF-2014S1A2A2028559).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Within the last decade, genome wide association studies (GWAS) have shown to be a powerful analytical tool in association mapping to identify common variants (i.e. variants with minor allele frequencies of more than 1%) that contribute to the heritability of complex diseases[1]. The GWAS methodology thereby utilizes the underlying, population specific linkage-disequilibrium structure of common variants and thus only a limited number of variants needs to be genotyped in order to capture the common variation within a population[2,3].

Although a large number of GWAS for various complex diseases have been conducted up to date, for most complex diseases, the proportion of the estimated variance that can be explained by common variation is rather low [4–6]. Besides alternative factors that are likely to contribute to the estimated heritability of a trait like gene-gene interactions, variants with low frequencies, environmental factors, etc., it has also been suggested that so far undetected common variants with low penetration rates could further increase the proportion of estimated genetic variance for complex diseases, although it is widely agreed that for most complex diseases, those low-penetration rate common variants that have not been detected so far are likely to show smaller effect sizes compared to the known common risk variants[3,7,8].

Park et al. [9] show on the basis of reported GWAS findings for height, Crohn’s disease, and breast, prostate and colorectal (BPC) cancers, that there are likely to be additional common variants with low penetration rates and rather small effect sizes that could still explain more than 15–20% of the estimated heritability of the traits. In order to detect those common variants with low penetration rates however, large sample sizes for GWAS are required to ensure sufficient power. Technological advances in genotyping technologies along with a rapid cost decline for genotyping services and the establishment of large research consortia have paved the way for those large-scale GWAS. Some recent examples of GWAS with large sample sizes that have identified new susceptibility loci in common variation include association studies on breast cancer risk [10], coronary artery disease [11,12], or ovarian cancer [13].

In most of the cases, those large-scale GWAS have a population-based instead of a family-based design, since the population based design has a greater power to detect causative loci[14] and the sample collection for population based designs is much easier, especially for late on set diseases [15]. One pitfall of population-based association studies in contrast to family-based association studies is however, that they bear the risk of population stratification. Population stratification refers to a non-homogeneous distribution of alleles among subjects in the sample due to ancestral differences and can bias the true association structure between the trait and the sampled genotypes[16]. It is thereby likely that the risk of population stratification for a genetic association study increases with its sample size, since the larger the sample size, the more difficult it becomes to ensure a homogenous ancestral background of the sampled individuals [17]. Even in a relatively homogenous study design the issue of population structure might affect the results [18,19].

There exist several methodological approaches on how to correct for population stratification in genetic association studies. One is the Genomic Control [20], which is based on the chi-square scores of the association tests. Genomic Control shows how inflated the calculated scores are, compared to the null distribution. It thereby assumes a constant inflation rate over the tested region, which might not always be the case, especially for large genomic regions. Next, there exist a group of model-based approaches, which estimate the population structure based on the observed genotypes. Two examples of those model-based approaches include STRUCTURE [21] and ADMIXTURE [22]. Both methods allow for fractional subpopulation memberships, but ADMIXTURE is more efficient in terms of runtime. Finally, another stream in the literature suggests to first calculate the genetic covariance matrix between studied individuals and then extract the principal components, which are further used as continuous covariates in association analysis [23–25]. There are several extensions of the PCA approach. Some of them are justified by the fact that using only continuous covariates might be not enough[26–28]. They try to extend the method by using also discrete membership covariates, obtained from clustering approaches. Other methods include mixed models with fixed and random effects to model the outcome variable[29,30]. Here the population structure is modeled as a random effect. It is important to mention that combining mixed models with principal component covariates can increase the quality of correction for population structure[31,32].

Rosenberg and Nordborg [33] showed that the occurrence of false positive associations is a serious problem in mixtures of discrete subpopulations, whereas in an admixed population it depends on the variance of admixture between individuals. When the variance is small, spurious associations are less severe.

In this manuscript we use network methodology to infer and to visualize population structure in genotype data. Normally, visual inspection of GWAS data is performed by creating a PCA or MDS plot. Here, we present a different way to visualize the ancestral population structure and possible batch artifacts. The intuition behind the method is well suited for a dataset, consisting of several discrete subpopulations, hence in this setting the method performs particularly well. The identified subpopulations can be subsequently included as discrete covariates into association analysis. We apply our method to different populations in the 1000 Genomes Project and evaluate our type 1 error compared to standard methodology in simulation studies. All calculations and simulations are implemented in R with the help of igraph package and Eigensoft[34,24].

Methods

Here we present an algorithm, how to build a social graph between individuals in a genotypic dataset. The main idea of the method is based on social network methodology. In order to identify closely related individuals in a given population, we group them into triads based on a predefined similarity measure, and then combine those triads into a graph for further analysis.

In terms of social network theory, triads can be thought of as the smallest social community, they consist of three individuals and describe the pairwise relations among themselves [35]. Following the existing theory on triads, those three individuals can be unconnected (empty triad), can be partially connected (one-edge triad, two-path triad), or all three individuals can be connected to each other (triangle triad).Since we aim to identify individuals that are closely linked to each other on the basis of their genetic profile, we utilize the triangle-structure of transitive triads to create the triplets, and thus define that for three individuals (A, B, C), not only A and B, and B and C, but also A and C have to be linked to each other based on their genetic profile.

We will further outline how we derive the social graph for a given set of individuals based on their genetic profile, and then describe how this graph can be partitioned into communities which can be utilized as covariates in association analysis.

Given n individuals and m biallelic markers we display them in a genotype matrix GT:

(1)

For the triad construction we need a similarity measure between individuals in order to identify their degree of "relatedness". This can be a simple Euclidian distance between genotype vectors [36] or a variance-covariance matrix between individuals [23,24]. For the sake of simplicity, we further outline our approach on the basis of the variance-covariance matrix, although it is important to note, that in principle any other similarity or distance matrix could be used.

Let D represent the genetic covariance matrix, calculated as described in Price et al. [24]. It is an nxn symmetric matrix and D_ij is the covariance between individuals i and j.

A triad in terms of graph theory can be defined as an undirected graph G = (V,E), where V is the vertex set (set of individuals) and E is the edge set (structure of connections):

(2)

For every individual i, we aim to find two other individuals j and k, so that the covariances between the three individuals are maximized in a stepwise manner. In order to simplify this pairwise optimization problem, we rank the other individuals based on their pairwise covariance with individual i as shown in Table 1. Those individuals that show the highest covariance with individual i are assigned the highest ranks, and those that show the smallest covariance are assigned the lowest ranks.

Download:

Table 1. Ranking procedure for individual i.

https://doi.org/10.1371/journal.pone.0130708.t001

Step 1

We first want to find the closest subject j to i. In order to do this we minimize the term (3). (3) Here rank⁽ⁱ⁾[j] represents the relative position of subject j for subject i and rank^(j)[i] represents the relative position of subject i for subject j based on the covariance matrix D. Such "two-sided" optimization allows us to account for two-sided "closeness" from i to j and from j to i. This gives us the advantage to create more robust pairs in terms of relatedness.

Step 2

We now want to find the third subject k in the triad. This subject should be closely related to both already included individuals i and j. In order to fulfill this requirement we again use the term (3), but for both dyads: {i,k} and {j,k}:

(4)

In such a way we make sure that the individual k is similar to both individuals in the dyad {i,j}.

After creating triadic groups for all n subjects, we merge the retrieved triads in a graph that will contain n vertices and 3*n edges and that can be further partitioned into communities by using community-detection algorithms.

Community detection

After a first look on the created graph, one can already determine robust groups by identifying unconnected components of the graph. Those components, as we show in our simulation studies, already present a valid partition into groups, especially with data, coming from discrete populations. In order to increase the resolution of clustering into groups within those unconnected components we further outline how communities within the network can be detected.

The community structure of a network or graph can be seen as groups of nodes which are densely connected within their community and sparsely connected between groups[37]. The simplest community detection analysis on a graph would be identifying its unconnected components. More sophisticated community-detection algorithms aim to measure the modularity, which shows the quality of a particular division of the network in communities [37]. It is defined as: (5) where e is a kxk matrix, whose entries e_ij represent the fraction of connections between communities i and j, and , k is the number of communities.

The modularity is calculated in such a way that it compares the fraction of connections within communities in a graph with the expected value of this fraction, given random connections, but the same community division. A value of 1 would represent a perfect community grouping.

We use the Louvain algorithm from Blondel et al. [38] which is computationally fast and doesn't require a predefined number of communities to be specified in advance. Initially, each node is assigned to its own community. Then, at each step a node is replaced to a neighboring community by maximizing the modularity gain with this replacement. The process stops if there is no further gain possible or there is only one community left. For typical and sparse data the computational time is linear and depends on the number of nodes. Blondel et al. applied the algorithm to a network with 118 million nodes and the computational time was 152 minutes. In case of genotype data the computational time of the community detection algorithm breaks down to milliseconds. For example for the European dataset of 378 individuals it took 0.02 seconds to calculate the communities, given a precalculated covariance matrix. The computational complexity is driven by calculation of the covariance matrix between the genotypes of individuals and is quadratic on the number of individuals.

Results

Data description and application

We applied our method to the phase 1 data of the 1000 Genomes Project [39]. The 1000 Genomes project offers a unique platform for sequence data analysis since the project provides data for more than 1000 sequenced genomes from various populations. We divided the whole dataset into 4 subsamples, according to the continental membership: Europeans, Africans, Americans, Asians. We applied minor allele frequency filtering with a threshold of 5% and a linkage disequilibrium pruning step. During this step we also excluded long-range LD regions [40]. After data cleaning we were left with 4 datasets, described in Table 2. We then applied our method to every dataset in order to detect subpopulations.

Download:

Table 2. Description of datasets, used in the analysis.

https://doi.org/10.1371/journal.pone.0130708.t002

We created scale-free plots of the constructed graphs in Figs 1–4. Every node represents an individual and the color of the node is related to the actual population label of this individual. For purposes of visualization we do not show the edges in the plots. In order to visualize the communities we constructed polygons around the nodes, which are assigned to this community. Every polygon represents 1 community. Full community assignment and the precision we described in Tables 3–6. One can clearly see an almost perfect separation of subpopulations in Africans and Americans. The separation in Asian subpopulations is slightly worse, there are some admixed communities, consisting of Han Chinese from Beijing and from the South. In the European subpopulations we see that the Finnish and Toscanian communities are very homogeneous. The 5 small heterogeneous communities mostly contain individuals from Utah and Great Britain, but this is expected,because the Utah residents from this dataset are known to have a high degree of shared ancestry with the British. For completeness we also included S1–S4 Tables which represent precision for unconnected components of the graphs. One can already determine a good separation for the subpopulations by looking only at the unconnected components.

Download:

Fig 1. 3 American subpopulations.

The polygons around the nodes represent the detected communities. The node colors represent the actual labels.

https://doi.org/10.1371/journal.pone.0130708.g001

Download:

Fig 2. 3 African subpopulations.

The polygons around the nodes represent the detected communities. The node colors represent the actual labels.

https://doi.org/10.1371/journal.pone.0130708.g002

Download:

Fig 3. 3 Asian subpopulations.

The polygons around the nodes represent the detected communities. The node colors represent the actual labels.

https://doi.org/10.1371/journal.pone.0130708.g003

Download:

Fig 4. 5 European subpopulations.

The polygons around the nodes represent the detected communities. The node colors represent the actual labels.

https://doi.org/10.1371/journal.pone.0130708.g004

Download:

Table 3. Contingency table for American subpopulations, rows correspond to detected communities, columns to actual subpopulations.

https://doi.org/10.1371/journal.pone.0130708.t003

Download:

Table 4. Contingency table for African subpopulations, rows correspond to detected communities, columns to actual subpopulations.

https://doi.org/10.1371/journal.pone.0130708.t004

Download:

Table 5. Contingency table for Asian subpopulations, rows correspond to detected communities, columns to actual subpopulations.

https://doi.org/10.1371/journal.pone.0130708.t005

Download:

Table 6. Contingency table for European subpopulations, rows correspond to detected communities, columns to actual subpopulations.

https://doi.org/10.1371/journal.pone.0130708.t006

Evaluation via simulated association studies

In order to evaluate the performance of our method for population stratification correction in association analysis, we conducted a simulation study. We compared our method to a principal component analysis approach (EIGENSTRAT [24,25]) and to a model-based approach (ADMIXTURE[22]). The simulation design followed the one described in Price et al. [24] and used in Alexander et al. [22] and is outlined below. For association analysis we used a logistic regression model with additional covariates (5), which are included to correct for population structure.

(6)

Here Y is the phenotype vector, X is the genotype and E are the additional covariates. For EIGENSTRAT, we ran the analysis with 1, 2 or 10 principal components. For ADMIXTURE, we used 1 or 2 entries of their ancestry estimate. For our method we created discrete dummy variables representing community membership and plugged them into the logistic regression. We also included discrete dummy variables, which represent unconnected sub-graphs in our network. This was done in order to evaluate how well the constructed graph already represents population structure.

We considered 4 population structure scenarios suggested by the literature [24,27]:

2 underlying discrete subpopulations with moderate differences between cases and controls;
2 underlying discrete subpopulations with extreme differences between cases and controls;
3 underlying discrete subpopulations;
admixed population.

For the discrete settings we simulated 1000 individuals (500 cases and 500 controls), coming from 2 or 3 populations with differentiation of Fst = 0.01. In scenario 1 60% of cases were from population 1 and 40% from population 2. In scenario 2 with more extreme mismatching cases had equal proportions between populations, whereas 100% of controls were from population 2. In scenario 3 we took the following proportions for cases and controls: 45% of cases and 35% of controls were from population 1, 35% and 20% were from population 2, 20% and 45% were from population 3. For the admixed setting we sampled individuals with ancestry proportions a and (1-a) from the 2 populations, where a is uniformly distributed. The ancestral risk was set to 3, and the probability of disease was set to 0.5 * log(r) * r^a / (r − 1) as described in Price et al. [24]. For every setting we generated 100.000 random training SNPs, which were used to infer population substructure. For association testing 1 million testing SNPs in each of the 3 categories were generated:

Random SNPs with no disease association (same setting as for the training SNPs generation)
Highly differentiated SNPs with no disease association: The minor allele frequency for SNPs, coming from population 1 and 2, was set as 0.8 and 0.2 respectively.
Causal SNPs: Multiplicative risk model with a relative risk of 1.5.

The results are presented below in Table 7. The values are averaged across 10 independent simulation runs.We included also a 'naive' setting, which is the false positive rate (or true positive for causal SNPs) for a logistic regression without covariates. For all three discrete settings one can see that already the inclusion of the unconnected sub-graphs as covariates in a logistic regression performs similar to logistic regression with EIGENSTRAT and ADMIXTURE covariates. A further separation of the graph into communities and the inclusion them as covariates into a logistic regression improves the results in the discrete scenarios. Especially when the underlying population structure consists of three discrete populations with moderate stratification in the dataset, it can be clearly seen that our approach achieves higher power than standard approaches, while maintaining the alpha error more precisely.

Download:

Table 7. Average proportions of significant SNPs in the simulation study.

https://doi.org/10.1371/journal.pone.0130708.t007

We expected that our method would not provide advantages over the existing methodology in the admixed scenario, but we included this scenario for completeness.As we see from Table 7 the proposed network-based approaches perform similarly to EIGENSTRAT and ADMIXTURE for random and causal SNPs from an admixed population. Only for the special admixed scenario in which the approaches are applied to differentiated SNPs, the network-based approaches are not able to maintain the significance level. However, this simulation scenario can be considered as not realistic for application.

Discussion

In this communication, we have proposed a new approach for population structure inference, which is based on network methodology. It is straight-forward to think of individuals in a population based study as of nodes in a big network, where the edges represent the strength of their relationship, based on a given similarity measure of their genotypes (i.e. covariance). Hence, population structure can be identified by detecting network communities. We present a heuristic method on how to generate a network structure between individuals based on their genotypic information and in a second step apply community detection algorithms in order to identify subpopulations. The method is computationally fast and flexible. To test its performance, we applied it to different subpopulations from the 1000 Genomes Project and the method was able to provide a very fine resolution of the population structure. We were not only able to separate the subpopulations within continents, but also we identified homogeneous subgroups within those subpopulations. Many of them represent individuals, which were sequenced on different platforms, whereas some of them might represent other artifacts or show small communities, which might include individuals with a stronger connection to each other.It is also important to mention that already the unconnected components of the constructed graphs show a valid separation of the individual subpopulations. Hence, applying community detection algorithms would provide a finer level of homogeneous subgroups within those unconnected components.

We further conducted a simulation study and compared the performance of our method to EIGENSTRAT and ADMIXTURE. For association testing, we used logistic regression with covariates. For every method we included the corresponding population-specific covariates: principal components for EIGENSTRAT, ancestry estimates for ADMIXTURE, and binary covariates representing the community memberships for our method.The results suggest that our method corrects for population structure more effectively when individuals come from several discrete subpopulations.

Given our findings, we have shown that network based approaches bear a great potential for subpopulation detection. The integration of network based methodology is intuitive and is suitable for large sample sizes. Due to its ability to group individuals into discrete communities without any prior information, we enable more flexible ways to analyze the data, e.g. conducting association analyses separately in the identified subsamples.The visualization of different levels of discrete communities (i.e. unconnected components or detected communities) is straightforward and the interpretation of the discrete groups is natural. Especially in the case, when little is known about the underlying structure of the data, one should use our assumption free network-based method. The high research dynamic in the field of network community detection algorithms will most likely further provide interesting applications for the identification of ancestry based genetic heterogeneity in populations. However, more research has to be done on how the information on the identified network communities can more efficiently be included in regression models.

Supporting Information

S1 Table. Contingency table for American subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

PUR—Puerto Rican, CLM—Colombian, MXL—Mexican.

https://doi.org/10.1371/journal.pone.0130708.s001

(DOCX)

S2 Table. Contingency table for African subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

YRI—Yoruba in Nigeria, LWK—Luhya in Kenia, ASW—African ancestry in Southwest US.

https://doi.org/10.1371/journal.pone.0130708.s002

(DOCX)

S3 Table. Contingency table for Asian subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

CHS—Southern Han Chinese, CHB—Han Chinese in Beijing, JPT—Japanese in Tokyo.

https://doi.org/10.1371/journal.pone.0130708.s003

(DOCX)

S4 Table. Contingency table for European subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

GBR—British in England and Scotland, FIN—Finnish, IBS—Iberian in Spain, CEU—Utah residents with Northern and Western European ancestry, TSI—Toscani in Italy.

https://doi.org/10.1371/journal.pone.0130708.s004

(DOCX)

Author Contributions

Conceived and designed the experiments: HLF DP CL ES MMN MS. Performed the experiments: DP JH. Analyzed the data: DP JH HLF. Wrote the paper: DP CL HLF.

References

1. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012; 90: 7–24. pmid:22243964
- View Article
- PubMed/NCBI
- Google Scholar
2. de Bakker PIW, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nat Genet. 2005; 37: 1217–1223. pmid:16244653
- View Article
- PubMed/NCBI
- Google Scholar
3. Goldstein DB. Common genetic variation and human traits. N Engl J Med. 2009; 360: 1696–1698. pmid:19369660
- View Article
- PubMed/NCBI
- Google Scholar
4. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009; 461: 747–753. pmid:19812666
- View Article
- PubMed/NCBI
- Google Scholar
5. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010; 11: 446–450. pmid:20479774
- View Article
- PubMed/NCBI
- Google Scholar
6. Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet. 2011; 88: 294–305. pmid:21376301
- View Article
- PubMed/NCBI
- Google Scholar
7. Park J-H, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, et al. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A. 2011; 108: 18026–18031. pmid:22003128
- View Article
- PubMed/NCBI
- Google Scholar
8. So H-C, Li M, Sham PC. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet Epidemiol. 2011; 35: 447–456. pmid:21618601
- View Article
- PubMed/NCBI
- Google Scholar
9. Park J-H, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet. 2010; 42: 570–575. pmid:20562874
- View Article
- PubMed/NCBI
- Google Scholar
10. Michailidou K, Hall P, Gonzalez-Neira A, Ghoussaini M, Dennis J, Milne RL, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat Genet. 2013; 45: 353–61, 361e1–2. pmid:23535729
- View Article
- PubMed/NCBI
- Google Scholar
11. CARDIoGRAMplusC4D Consortium, Deloukas P, Kanoni S, Willenborg C, Farrall M, Assimes TL, et al. Large-scale association analysis identifies new risk loci for coronary artery disease. Nat Genet. 2013; 45: 25–33. pmid:23202125
- View Article
- PubMed/NCBI
- Google Scholar
12. Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, Holm H, et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet. 2011; 43: 333–338. pmid:21378990
- View Article
- PubMed/NCBI
- Google Scholar
13. Pharoah PDP, Tsai Y-Y, Ramus SJ, Phelan CM, Goode EL, Lawrenson K, et al. GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer. Nat Genet. 2013; 45: 362–70, 370e1–2. pmid:23535730
- View Article
- PubMed/NCBI
- Google Scholar
14. Laird NM, Lange C. Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet. 2006; 7: 385–394. pmid:16619052
- View Article
- PubMed/NCBI
- Google Scholar
15. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008; 9: 356–369. pmid:18398418
- View Article
- PubMed/NCBI
- Google Scholar
16. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004; 36: 512–517. pmid:15052271
- View Article
- PubMed/NCBI
- Google Scholar
17. Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, et al. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004; 36: 388–393. pmid:15052270
- View Article
- PubMed/NCBI
- Google Scholar
18. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, et al. Demonstrating stratification in a European American population. Nat Genet. 2005; 37: 868–872. pmid:16041375
- View Article
- PubMed/NCBI
- Google Scholar
19. Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J, Stefánsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005; 37: 90–95. pmid:15608637
- View Article
- PubMed/NCBI
- Google Scholar
20. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999; 55: 997–1004. pmid:11315092
- View Article
- PubMed/NCBI
- Google Scholar
21. Pritchard JK, Stephens M, Donnelly P. Inference of Population Structure Using Multilocus Genotype Data. Genetics. 2000; 155: 945–959. pmid:10835412
- View Article
- PubMed/NCBI
- Google Scholar
22. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009; 19: 1655–1664. pmid:19648217
- View Article
- PubMed/NCBI
- Google Scholar
23. Zhu X, Zhang S, Zhao H, Cooper RS. Association Mapping, Using a Mixture Model for Complex Traits. Genet Epidemiol. 2002; 196: 181–196.
- View Article
- Google Scholar
24. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006; 38: 904–909. pmid:16862161
- View Article
- PubMed/NCBI
- Google Scholar
25. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006; 2: e190. pmid:17194218
- View Article
- PubMed/NCBI
- Google Scholar
26. Liu L, Zhang D, Liu H, Arendt C. Robust methods for population stratification in genome wide association studies. BMC Bioinformatics. 2013; 14: 132. pmid:23601181
- View Article
- PubMed/NCBI
- Google Scholar
27. Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008; 32: 215–226. pmid:18161052
- View Article
- PubMed/NCBI
- Google Scholar
28. Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM. A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet. 2007; 81: 895–905. pmid:17924333
- View Article
- PubMed/NCBI
- Google Scholar
29. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010; 42: 348–354. pmid:20208533
- View Article
- PubMed/NCBI
- Google Scholar
30. Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010; 42: 355–360. pmid:20208535
- View Article
- PubMed/NCBI
- Google Scholar
31. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010; 11: 459–463. pmid:20548291
- View Article
- PubMed/NCBI
- Google Scholar
32. Sul JH, Eskin E. Mixed models can correct for population structure for genomic regions under selection. Nat Rev Genet. 2013; 14: 300. pmid:23438871
- View Article
- PubMed/NCBI
- Google Scholar
33. Rosenberg N, Nordborg M. A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations. Genetics. 2006; 173: 1665–1678. pmid:16582435
- View Article
- PubMed/NCBI
- Google Scholar
34. Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. Complex Systems 1695. 2006. Available: http://igraph.org.
35. Hanneman RA, Riddle M. Introduction to social network methods. Riverside, CA: University of California, Riverside. Available: http://faculty.ucr.edu/~hanneman.
36. Schneider S, Roessli D, Excoffier L. arlequin V2.0: a Software for Population Genetic Analysis. Genetics and Biometry Laboratory, University of Geneva, Geneva; 2000.
37. Newman M, Girvan M. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004; 69: 026113. pmid:14995526
- View Article
- PubMed/NCBI
- Google Scholar
38. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008; P10008.
39. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491: 56–65 pmid:23128226
- View Article
- PubMed/NCBI
- Google Scholar
40. Price AL, Weale ME, Patterson N, Myers SR, Need AC, Shianna KV, et al. Long-range LD can confound genome scans in admixed populations. Am J Hum Genet. 2008; 83: 132–5; author reply 135–9. pmid:18606306
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012; 90: 7–24. pmid:22243964
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. de Bakker PIW, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nat Genet. 2005; 37: 1217–1223. pmid:16244653
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Goldstein DB. Common genetic variation and human traits. N Engl J Med. 2009; 360: 1696–1698. pmid:19369660
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009; 461: 747–753. pmid:19812666
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010; 11: 446–450. pmid:20479774
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet. 2011; 88: 294–305. pmid:21376301
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Park J-H, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, et al. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A. 2011; 108: 18026–18031. pmid:22003128
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. So H-C, Li M, Sham PC. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet Epidemiol. 2011; 35: 447–456. pmid:21618601
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Park J-H, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet. 2010; 42: 570–575. pmid:20562874
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Michailidou K, Hall P, Gonzalez-Neira A, Ghoussaini M, Dennis J, Milne RL, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat Genet. 2013; 45: 353–61, 361e1–2. pmid:23535729
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. CARDIoGRAMplusC4D Consortium, Deloukas P, Kanoni S, Willenborg C, Farrall M, Assimes TL, et al. Large-scale association analysis identifies new risk loci for coronary artery disease. Nat Genet. 2013; 45: 25–33. pmid:23202125
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, Holm H, et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet. 2011; 43: 333–338. pmid:21378990
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Pharoah PDP, Tsai Y-Y, Ramus SJ, Phelan CM, Goode EL, Lawrenson K, et al. GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer. Nat Genet. 2013; 45: 362–70, 370e1–2. pmid:23535730
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Laird NM, Lange C. Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet. 2006; 7: 385–394. pmid:16619052
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008; 9: 356–369. pmid:18398418
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004; 36: 512–517. pmid:15052271
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, et al. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004; 36: 388–393. pmid:15052270
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, et al. Demonstrating stratification in a European American population. Nat Genet. 2005; 37: 868–872. pmid:16041375
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J, Stefánsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005; 37: 90–95. pmid:15608637
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999; 55: 997–1004. pmid:11315092
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref21] 21. Pritchard JK, Stephens M, Donnelly P. Inference of Population Structure Using Multilocus Genotype Data. Genetics. 2000; 155: 945–959. pmid:10835412
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref22] 22. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009; 19: 1655–1664. pmid:19648217
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref23] 23. Zhu X, Zhang S, Zhao H, Cooper RS. Association Mapping, Using a Mixture Model for Complex Traits. Genet Epidemiol. 2002; 196: 181–196.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref24] 24. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006; 38: 904–909. pmid:16862161
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref25] 25. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006; 2: e190. pmid:17194218
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref26] 26. Liu L, Zhang D, Liu H, Arendt C. Robust methods for population stratification in genome wide association studies. BMC Bioinformatics. 2013; 14: 132. pmid:23601181
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref27] 27. Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008; 32: 215–226. pmid:18161052
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref28] 28. Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM. A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet. 2007; 81: 895–905. pmid:17924333
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref29] 29. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010; 42: 348–354. pmid:20208533
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref30] 30. Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010; 42: 355–360. pmid:20208535
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref31] 31. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010; 11: 459–463. pmid:20548291
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref32] 32. Sul JH, Eskin E. Mixed models can correct for population structure for genomic regions under selection. Nat Rev Genet. 2013; 14: 300. pmid:23438871
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref33] 33. Rosenberg N, Nordborg M. A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations. Genetics. 2006; 173: 1665–1678. pmid:16582435
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref34] 34. Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. Complex Systems 1695. 2006. Available: http://igraph.org.

[ref35] 35. Hanneman RA, Riddle M. Introduction to social network methods. Riverside, CA: University of California, Riverside. Available: http://faculty.ucr.edu/~hanneman.

[ref36] 36. Schneider S, Roessli D, Excoffier L. arlequin V2.0: a Software for Population Genetic Analysis. Genetics and Biometry Laboratory, University of Geneva, Geneva; 2000.

[ref37] 37. Newman M, Girvan M. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004; 69: 026113. pmid:14995526
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref38] 38. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008; P10008.

[ref39] 39. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491: 56–65 pmid:23128226
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref40] 40. Price AL, Weale ME, Patterson N, Myers SR, Need AC, Shianna KV, et al. Long-range LD can confound genome scans in admixed populations. Am J Hum Genet. 2008; 83: 132–5; author reply 135–9. pmid:18606306
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

Figures

Abstract

Introduction

Methods

Step 1

Step 2

Community detection

Results

Data description and application

Evaluation via simulated association studies

Discussion

Supporting Information

S1 Table. Contingency table for American subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

S2 Table. Contingency table for African subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

S3 Table. Contingency table for Asian subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

S4 Table. Contingency table for European subpopulations, rows correspond to unconnected components, columns to actual subpopulations.

Author Contributions

References