Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Strategies for Enriching Variant Coverage in Candidate Disease Loci on a Multiethnic Genotyping Array

  • Stephanie A. Bien ,

    ccarlson@fredhutch.org (CSC); sbien@fredhutch.org (SAB)

    Affiliation Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

  • Genevieve L. Wojcik,

    Affiliation Department of Genetics, Stanford University, Stanford, California, United States of America

  • Niha Zubair,

    Affiliation Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

  • Christopher R. Gignoux,

    Affiliation Department of Genetics, Stanford University, Stanford, California, United States of America

  • Alicia R. Martin,

    Affiliation Department of Genetics, Stanford University, Stanford, California, United States of America

  • Jonathan M. Kocarnik,

    Affiliation Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

  • Lisa W. Martin,

    Affiliation Division of Cardiology, George Washington University School of Medicine and Health Sciences, Washington, DC, United States of America

  • Steven Buyske,

    Affiliation Department of Genetics, School of Arts and Sciences, Rutgers University, Piscataway, New Jersey, United States of America

  • Jeffrey Haessler,

    Affiliation Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

  • Ryan W. Walker,

    Affiliations The Charles Bronfman Institute for Personalized Medicine, The Icahn School of Medicine at Mount Sinai, New York, New York, United States of America, The Department of Preventive Medicine, The Icahn School of Medicine at Mount Sinai, New York, New York, United States of America

  • Iona Cheng,

    Affiliation Cancer Prevention Institute of California, Fremont, California, United States of America

  • Mariaelisa Graff,

    Affiliation Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina, United States of America

  • Lucy Xia,

    Affiliation Department of Preventive Medicine, Keck School of Medicine, University of Southern California/Norris Comprehensive Cancer Center, Los Angeles, California, United States of America

  • Nora Franceschini,

    Affiliation Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina, United States of America

  • Tara Matise,

    Affiliation Department of Genetics, School of Arts and Sciences, Rutgers University, Piscataway, New Jersey, United States of America

  • Regina James,

    Affiliation Division of Intramural Research, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America

  • Lucia Hindorff,

    Affiliation Division of Genomic Medicine, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America

  • Loic Le Marchand,

    Affiliation Department of Epidemiology Program, University of Hawai’i Cancer Center, Honolulu, Hawai’i, United States of America

  • Kari E. North,

    Affiliation Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina, United States of America

  • Christopher A. Haiman,

    Affiliation Department of Preventive Medicine, Keck School of Medicine, University of Southern California/Norris Comprehensive Cancer Center, Los Angeles, California, United States of America

  • Ulrike Peters,

    Affiliations Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America, Department of Epidemiology, University of Washington, Seattle, Washington, United States of America

  • Ruth J. F. Loos,

    Affiliations The Charles Bronfman Institute for Personalized Medicine, The Icahn School of Medicine at Mount Sinai, New York, New York, United States of America, The Department of Preventive Medicine, The Icahn School of Medicine at Mount Sinai, New York, New York, United States of America

  • Charles L. Kooperberg,

    Affiliation Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

  • Carlos D. Bustamante,

    Affiliation Department of Genetics, Stanford University, Stanford, California, United States of America

  • Eimear E. Kenny,

    Affiliations The Charles Bronfman Institute for Personalized Medicine, The Icahn School of Medicine at Mount Sinai, New York, New York, United States of America, The Department of Preventive Medicine, The Icahn School of Medicine at Mount Sinai, New York, New York, United States of America

  • Christopher S. Carlson ,

    ccarlson@fredhutch.org (CSC); sbien@fredhutch.org (SAB)

    Affiliations Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America, Department of Epidemiology, University of Washington, Seattle, Washington, United States of America

  •  [ ... ],
  • on behalf of PAGE Study

    Membership of the PAGE Consortia are provided in the Acknowledgments.

  • [ view all ]
  • [ view less ]

Abstract

Investigating genetic architecture of complex traits in ancestrally diverse populations is imperative to understand the etiology of disease. However, the current paucity of genetic research in people of African and Latin American ancestry, Hispanic and indigenous peoples in the United States is likely to exacerbate existing health disparities for many common diseases. The Population Architecture using Genomics and Epidemiology, Phase II (PAGE II), Study was initiated in 2013 by the National Human Genome Research Institute to expand our understanding of complex trait loci in ethnically diverse and well characterized study populations. To meet this goal, the Multi-Ethnic Genotyping Array (MEGA) was designed to substantially improve fine-mapping and functional discovery by increasing variant coverage across multiple ethnicities at known loci for metabolic, cardiovascular, renal, inflammatory, anthropometric, and a variety of lifestyle traits. Studying the frequency distribution of clinically relevant mutations, putative risk alleles, and known functional variants across multiple populations will provide important insight into the genetic architecture of complex diseases and facilitate the discovery of novel, sometimes population-specific, disease associations. DNA samples from 51,650 self-identified African ancestry (17,328), Hispanic/Latino (22,379), Asian/Pacific Islander (8,640), and American Indian (653) and an additional 2,650 participants of either South Asian or European ancestry, and other reference panels have been genotyped on MEGA by PAGE II. MEGA was designed as a new resource for studying ancestrally diverse populations. Here, we describe the methodology for selecting trait-specific content for use in multi-ethnic populations and how enriching MEGA for this content may contribute to deeper biological understanding of the genetic etiology of complex disease.

Introduction

Over the last decade genetic research has made marked advances in cataloging variants associated with complex traits and human diseases, which in turn has shed new light on etiological processes. Initial GWAS prioritized genetically homogeneous populations to help prevent spurious findings that could result from population stratification. As such, initial genotyping platforms were intended to capture common genetic variation in populations of European descent, and later designed to enable efficient imputation of variant reference panels derived from very large sets of European-descent controls [1]. The sparsity of genetic research in American minority populations has led to insufficient sample sizes and inadequate genomic tools for diverse populations. Limited understanding of the genetic heterogeneity in disease loci across populations could greatly limit variant discovery efforts and exacerbate existing health disparities in many complex diseases [2]. For example, development of genetic risk models based on European-descent genetic architecture may, in some cases, have reduced predictive accuracy in populations with greater genetic diversity and less linkage disequilibrium (LD) among variants such as African-descent populations [3]. Furthermore, in conjunction with decreases in genotyping cost, enriching arrays for functional variation and inclusion of more variants in known GWAS regions can be used as an advantageous research tool. By leveraging differences in genetic architecture across ancestral populations, transethnic studies can be used to hone in on likely causal variants in known disease or trait loci. This insight can enable rich inferences about the underlying biology of complex diseases and may improve risk modeling across diverse populations.

The Population Architecture using Genomics and Epidemiology (PAGE) consortium (http://www.pagestudy.org) was initiated in 2008 by the National Human Genome Research Institute (NHGRI) to investigate the epidemiologic architecture of well-replicated genetic variants associated with complex diseases in several large, ethnically diverse population-based studies [4]. The PAGE II consortium, cofounded by the NHGRI and the National Institute for Minority Health and Health Disparities, consists of four large, ongoing population-based studies/consortia. Three of these studies were members of the initial PAGE collaboration: the Women's Health Initiative (WHI, http://www.whi.org) [5]; the Multiethnic Cohort (MEC; http://www.uhcancercenter.org/research/the-multiethnic-cohort-study-mec) [6]; a subset of the studies comprising the consortium Causal Variants Across the Life Course (CALiCo)—Atherosclerosis Risk in Communities (ARIC, http://www.cscc.unc.edu.offcampus.lib.washington.edu/aric/) [7], Coronary Artery Risk Development in Young Adults (CARDIA; http://www.cardia.dopm.uab.edu/) [8], and the Hispanic Community Health Study/Study of Latinos (HCHS/SOL, http://www.cscc.unc.edu.offcampus.lib.washington.edu/hchs/) [9]. The fourth PAGE II study is the Charles Bronfman Institute for Personalized Medicine at the Icahn School of Medicine at Mount Sinai (ISMMS), which curates an electronic health record (EHR)-linked medical care setting biorepository (BioMeTM, http://icahn.mssm.edu/research/institutes/ipm/programs/biome-biobank). In total, 51,650 self-reported African American/African ancestries, Hispanic/Latino, Asian/Pacific Islander, and American Indian participants have been genotyped on the Multi-Ethnic Global Array (MEGA). In comparison to previous efforts, the addition of customized content on MEGA was designed to enable deeper functional exploration of known disease-associated regions, particularly for less frequent (1–5%) and rare (<1%) genetic variation.

The MEGA array was designed predominantly through the collaborative efforts of Illumina, the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), and PAGE II to empower GWAS in diverse ancestry populations. Customized content was included on the MEGA to improve our understanding of genetic loci associated with complex human diseases or traits and to evaluate potential genetic heterogeneity across racial/ethnic groups. Specifically, we selected custom content to: 1) explore the generalization or replication of previously-reported genotype-phenotype associations at known loci; 2) identify independent or population-specific variant associations within known associated regions; 3) leverage differences in haplotype structure (LD) across populations to hone in on likely causal variants in known trait-associated regions (‘fine-mapping’); and 4) perform candidate functional SNP analysis using variants known to be (a) clinically relevant, (b) implicated in candidate pathways, or (c) validated as regulatory through laboratory analyses such as allelic reporter assays. Given that functional variants for the majority of GWAS signals have not been identified, the design of MEGA will enable researchers to provide new understanding of the biology underlying known disease associations and improve the generalizability of risk models across populations. Here, we describe both the custom designed MEGA content and discuss valuable insights gained through design efforts that may be of interest to the greater research community.

Methods

MEGA content allocation

MEGA content was partitioned into two major categories: 1) ‘backbone content’ used for agnostic GWAS and exome analyses in ancestrally diverse populations, and (2) ‘PAGE hand-curated custom content’ for targeted analyses aimed at discovering causal variants. The ‘backbone’ of the array has been described elsewhere [10,11] (https://www.pagestudy.org/index.php/multi-ethnic-genotyping-array) and is summarized in Table 1. Briefly, the backbone contains highly informative SNPs for GWAS analyses in European and East Asian descent populations for backwards compatibility with other genotyping arrays. These variants, often referred to as tag SNPs, are positioned in regions of the genome with high LD and typically represent common haplotypes in populations of either European or East Asian descent. However, the vast majority of the MEGA backbone content (83%) empowers GWAS and Exome analyses in African ancestry and Hispanic/Latino populations. The backbone content also includes the tag SNP phenotype associations from the NHGRI GWAS catalog, SNPs that are mentioned in four or more publications (‘SNPs in Publications’ database from the UCSC table browser)[12], and clinically relevant variants (Table 1).

Overview of custom content selection

The custom content available on MEGA was taken from four primary sources: large-scale GWAS, thorough literature review, publicly available variant databases (e.g. 1000 Genomes Phase 3 Project, ClinVar, OMIM, UCSC Table Browser ‘SNPs in Publications’ track), and recommendations from experts on traits of interest to PAGE II. Along with variant annotation tools (e.g. Variant Effect Predictor and ANNOVAR) these resources were used to prioritize known trait loci and select content that could be used to: 1) replicate or generalize index GWAS associations, 2) augment GWAS tagging SNPs in priority regions 3) enhance exome content in priority regions, 4) fine-map GWAS loci, (5) identify functional regulatory variants, (6) explore penetrance/frequency of clinically reported variants in a population-based cohort, and (7) identify novel variant loci in suspected candidate pathways.

Prioritizing known GWAS trait loci

Summary statistics for the largest available GWAS were mined to nominate both known and candidate loci tagging effects for traits related to type 2 diabetes (T2D), inflammation, lipids, coronary heart disease (CHD), blood pressure, kidney disease, anthropometric measurements, various lifestyle traits like alcohol consumption, smoking, and reproductive traits. These traits were prioritized based on the availability of phenotypic data in the samples selected for genotyping in PAGE II. For GWAS-based datasets, we queried summary level data for the aforementioned traits, including both published and unpublished datasets. Using the NHGRI GWAS catalog [13] (available at: http://www.ebi.ac.uk/gwas date of access: March 19, 2014), we first identified traits directly or indirectly related to prioritized PAGE II traits. In order to maximize power, associations directly related to our traits were rank ordered and studies with smaller discovery populations were prioritized to give greater weight to variants with larger effects. The top 500 ranked associations were selected for locus refinement. For GWAS augmentation (described below), loci were defined as 100kb upstream and downstream (200kb total) of the index SNP position in the GRCh37/hg19 build.

Variant annotation

All variants selected for inclusion were annotated using the ‘Snp 142 common’ and ‘All SNPs (142)’ datasets in the UCSC Table Browser. We also used the bedtools utility ‘intersect’ to identify overlap between variant lists, such as the list of functional SNPs, and PAGE II trait loci of interest (GWAS loci and prioritized trait group genes). Coding variants were annotated using multiple annotation methods (ANNOVAR, Variant Effect Predictor, Variant Annotation Integrator, and Variant Annotator-GATK) and the most likely deleterious annotation was assigned.

Augmentation of GWAS tagging SNPs in priority loci

MEGA contains a GWAS scaffold designed for enhanced imputation accuracy across multiple populations [11]. In addition to providing an essential tool that can be used for more comprehensive genetic studies in diverse populations (for both less frequent and common variants), we aimed to enhance genome-wide coverage in regions previously implicated for PAGE II traits of interest. Our methodology to develop the improved multi-ethnic tag SNP selection for MEGA [14] was extended to the custom content regions. Assigning the rest of MEGA as fixed content, tags were selected to enhance coverage in the regions of interest across all 6 continental populations found in 1000 Genomes Project Phase 3 data: Admixed African-Descent (AAC), African (AFR), Americas (AMR), East Asian (ASN), European (EUR), and South Asian (SAS). Tags were selected using a minimum MAF of 1% and a minimum LD r2 of 0.2. GWAS loci were defined as 100kb upstream and downstream (200kb total) of the index SNP position in the GRCh37/hg19 build.

Augmentation exome SNPs in priority loci

In the prioritized genes of interest, we lowered the MAF threshold compared to the rest of the MEGA backbone to allow inclusion of doubleton or singleton observations. Synonymous variants were also included in these genes because evidence suggests these can also be functional, whether as miRNA targets or as exonic splice enhancers/silencers [1523].

Fine-mapping SNPs in priority loci

For our list of GWAS catalog prioritized trait associations (see details above), we defined the variants at a locus as the index variant and all correlated SNPs with an r2 ≥0.6 within 200 Kb. LD calculations were derived from 1000 Genomes Phase 3 Project SNP coverage (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr10.phase3_shapeit2_mvncall_integrated_v2.20130502.genotypes.vcf.gz). Using vcftools v0.1.12a, pairwise LD was calculated in the relevant 1000 Genomes Project super populations: EUR was used when the index discovery sample included European-descent populations, and ASN was used for discovery populations of East or South Asian-descent [24]. Tagged SNPs were excluded if they were triallelic or they had a MAF < 1%.

Identification of functional regulatory variants

In order to perform functionally informed analyses, we included variants that regulate gene expression as determined through laboratory analyses such as luciferase reporter assays and electromobility shift assays (EMSA). To achieve this, we mined both PubMed and Google Scholar to extract terms in the abstract and text. Our search included the following terms: “reporter”, “EMSA”, “luciferase”, “catalase”, “GFP”, or “Chromatin Immunoprecipitation.”

One of the largest challenges in this endeavor was identifying which variant was being reported. There are many different ways to reference a polymorphism, and prior to GWAS, the variant was almost never referred to by the now-standardized dbSNP rsID number. For instance, pharmacogenetic variants were often reported using a ‘gene asterisk variant’ notation (eg. CYP2A6*9) and many other studies reported variants relative to Transcription Start Sites (TSS; eg. -1031T/C). Furthermore, upstream and downstream references were arbitrarily defined as ‘+’ or ‘-‘.When a sequence was provided we used the “BLAT” tool from UCSC Genome Browser (http://genome.ucsc.edu/) to identify the correct variant. If no sequence was available but primers were provided we used the UCSC ‘In-Silico PCR’ tool. Occasionally we were able to find linkage between the rsID number and the alias in a different article using the ‘SNPs in Pubs’ track. Articles were further reviewed in an effort to verify that reports of ‘differential allelic expression’ were statistically significant after Bonferroni correction when multiple variants were analyzed.

Results

Augmented GWAS loci and trait related genes

After hand curation of the GWAS catalog tag SNPs directly related to the following traits were selected: T2D (n = 95), inflammation (n = 527), lipids (n = 379), CHD (n = 280), blood pressure (n = 121), kidney disease (n = 92), anthropometric (n = 644), and lifestyle or menstrual related traits (n = 107). In total, we rank ordered the list of 4,453 unique locus tag SNPs. The top 500 ranked associations were selected for locus refinement.

In addition to prioritizing these known and well-powered GWAS loci, 166 genes (Table 2) were prioritized across these eight trait groups based on thorough literature review, clinical relevance, and recommendation by experts with proficient genetic knowledge and clinical knowledge on the priority PAGE II traits. Gene coordinates were defined using the UCSC Table Browser UCSC Genes (GRCh37/hg19) and taking the union of the reported transcripts. These 166 gene coordinates were used to augment exome content and select GWAS tag SNPs.

Custom content on MEGA

After applying Illumina design filters to select SNPs that are less likely to fail and removing overlap between categories a total of 48,091 SNPs were included as custom design for MEGA (Table 3). In comparison to the GWAS backbone we observed that the customized content was enriched for rarer variation (Fig 1).

thumbnail
Fig 1. Enrichment of rarer variation in custom content.

Comparison of minor allele frequency distribution between the MEGA GWAS backbone and the custom content stratified by race. Allele frequencies were calculated in PAGE II study populations.

https://doi.org/10.1371/journal.pone.0167758.g001

thumbnail
Table 3. Custom Content Variants Before and After Design.

https://doi.org/10.1371/journal.pone.0167758.t003

Replication or generalization of index associations

All published genome-wide significant variants related to PAGE II traits of interest were specifically chosen for inclusion on MEGA. In addition, we included 302 unpublished variants reaching genome-wide significance in PAGE I studies or in the large consortia―GIANT (https://www.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium), and MAGIC (http://www.magicinvestigators.org/). These loci were reported in European, Hispanic, Asian, or African American populations. In addition, we performed a literature search in PubMed for targeted genotyping arrays (‘MetaboChip’, ‘ExomeChip’, ‘OncoArray’, and ‘ImmunoChip’) to identify additional associations that may not have been included in the NHGRI GWAS catalog. We identified an additional 102 variants significantly associated with one or more PAGE II prioritized traits.

Augmented GWAS Tagging SNPs in Priority Regions

A total of 28,465 tag SNPs were selected for augmenting coverage over 84.7 Mb containing 157 priority genes (+/- 50kb flanking regions) and 459 highly ranked GWAS loci (+/- 100kb flanking regions). Of those tag SNPs 25,388 made it onto MEGA after QC (S1 Table). In addition to the background GWAS scaffold, there were on average 41 tag SNPs per prioritized genes and GWAS top hit. (S7 and S8 Tables)

Imputation accuracy was assessed through a leave-one-out internal validation approach, using the 1000 Genomes Project phase 3 data. Pearson’s correlation was estimated between the imputed dosages and the genotypes found in the original data. A scaffold of all MEGA sites was compared to MEGA excluding all custom content, including enhanced exome content, and SNPs added to fine-map previously implicated GWAS loci. An increase in performance was seen across all populations (Fig 2). The largest increase in imputation accuracy was found in low frequency loci, with a MAF below 5%. Imputation accuracy improvement was most notable for AA populations, with an accuracy increase of 3.02% points in AFR for loci with MAF between 0.5% and 1% (Table 4).

thumbnail
Fig 2. Improved imputation accuracy found with custom content sites in regions of interest.

Solid lines denote the imputation accuracy of MEGA including the custom content, while dashed lines indicate the performance of MEGA without the custom content. Admixed populations are on the left, with continental populations found on the right.

https://doi.org/10.1371/journal.pone.0167758.g002

thumbnail
Table 4. Enhanced Imputation Accuracy with Custom Content Addition.

https://doi.org/10.1371/journal.pone.0167758.t004

Enhanced exome in priority regions

In the 166 genes of interest we selected 14,000 additional variants for inclusion on MEGA. In total 5,733 additional exome content SNPs were included on MEGA after exclusion of those with poor quality, mapping issues, or other synthesis issues. Using multiple annotation methods (ANNOVAR, Variant Effect Predictor, Variant Annotation Integrator, and Variant Annotator-GATK) and taking the most likely deleterious annotation assigned to each variant, we obtained: 36 stop gained, 2 stop lost, 97 frameshift, 49 in frame variants, 12 splicing variants, 1,144 3’ UTR, 810 5’ UTR, 1,725 non-synonymous and, 3,144 synonymous variants. The median allele frequency of these variants across populations was 0.05% (interquartile range = 0.04%-0.1%).

Fine-mapping GWAS loci

For our list of GWAS catalog prioritized trait associations, we defined the variants at a locus as the index variant and all correlated SNPs with an r2 ≥0.6 within 200 Kb. This resulted in 15,049 variants representing 459 prioritized loci directly related to our traits of interest (S3 Table). We included up to 338 variants per fine-mapped locus, with an average of 33 (SD = 38) variants. After applying the Illumina design score filter of >0.5, 12,199 SNPs in 451 independent loci remained. The average number of variants per locus was similar with a mean of 27 (SD = 32) variants. The average minor allele frequency was 39% across the 12,199 variants.

Identification of Functional Regulatory Variants

In total, our literature search identified 2,610 variants showing significant differential allelic gene expression. We found that of the variants tested in an allelic assay that were also described as having function only 260 (10%) were strongly tagged (r2>0.6) by a GWAS index SNP. Although these variants could be the underlying functional variant for the association, it should be noted that they are not necessarily the variant underlying the association and in some circumstances there may be multiple or ethnic specific variants in a known locus. However, this supports the assertion that for the vast majority of associated loci the underlying causal variants have not been fine-mapped. Furthermore, while many of the GWAS loci identified to date are positioned in intergenic regions, until recently most of the regulatory variant follow-up has been conducted in promoters where the target gene is known (Fig 3).

thumbnail
Fig 3. The functional hypothesis tested (‘Hypothesized Function”) by year for 2,610 variants reported in a functional allelic assay found through literature review.

https://doi.org/10.1371/journal.pone.0167758.g003

Selection of medically important variants

Putative risk variants identified by exome sequencing of familial and population based samples, as well as those derived from literature review for highly penetrant diseases related to our more common traits of interest, were also included on the array. To accomplish this, we performed a systematic literature and database search for all mutations known to cause medical traits like hyper/hypolipidemia, hypercholesterolemia, dyslipidemia, lipodystrophy, arteriosclerotic heart disease, chronic kidney disease, extreme obesity, maturity onset diabetes of the young (MODY), long QT syndrome, Brugada syndrome, alcohol and nicotine dependence or sensitivity, systemic lupus erythematosus, and airway hyper-reactivity. Using ClinVar [25], OMIM [26] and NextBio (http://www.illumina.com/informatics/research/biological-data-interpretation/nextbio.html) we obtained 2,655 candidate variants.

Identification of variants in candidate pathways

We selected 1,546 variants for candidate pathway analyses based on their involvement with pharmacokinetics or pharmacology (for example: absorption, distribution, metabolism and excretion-ADME; or drug-metabolizing enzymes and transporters-DMET) from publicly available resources (www.pharmaadme.org, http://www.snpedia.com/index.php/Pharma_DMET, http://www.drugbank.ca/genobrowse/snp-adr).

Discussion

The design of MEGA enables the most comprehensive examination of genetic architecture across ancestrally diverse populations to date, and thus provides a key tool for discovering important insights for many complex diseases and traits. In this manuscript, we have outlined the custom content that was hand-curated by PAGE II in order to improve our understanding of genetic loci associated with complex human diseases or traits and to evaluate potential genetic heterogeneity across racial/ethnic groups. This high-value content includes: (1) variants relevant to common, complex phenotypes of interest to PAGE II; (2) candidate functional variants in non-coding regions curated from the literature; (3) fine-mapping content selected to refine established GWAS signals reported in the GWAS catalog; and (4) augmented coverage in candidate regions containing either genes of interest or relevant GWAS associations.

The extensive effort taken to utilize existing knowledge on the genetic etiology of complex traits has enabled PAGE II to enrich the genotyping content in those regions most likely to influence our traits of interest across all populations. Ultimately, the usage of MEGA in this large multi-ethnic study will provide necessary insight into the genetics of complex diseases and help ensure that the benefits gained from genetic research are equitably distributed across diverse populations. For instance, development of genetic risk models based on European genetic architecture alone may in some cases reduce predictive accuracy in other ethnicities [2730]. Additionally, by leveraging differences in genetic architecture across ancestral populations, transethnic studies can be used to hone in on likely functional variants in known disease or trait loci. Genetic risk models built on tagging variation that is informative across populations will ensure that all ethnic/racial groups are benefitting from the knowledge gained from the public investment in genetic research. Furthermore, by enabling functional insight in genetic risk loci, inferences on the underlying biology of complex diseases can better inform the development of treatment therapies.

The customized content selection sought to identify a set of truly functional variants from thorough literature review. MEGA includes variants that mark associations with gene expression (6689 genomic loci regulating mRNA expression-eQTLS) and those predicted in silico to have regulatory potential (RegulomeDB); these variants represent a gold standard for establishing regulatory function. The inclusion of variants that have been shown to influence gene expression in the laboratory will assist in fine-mapping associated loci and enable candidate variant associations. In our hand-curation process, we found that although more than a thousand allelic functional assays have been published, many were from before the time of GWAS and most were not conducted as a follow-up to a GWAS study. Furthermore, most of the associated loci to date did not overlap with a variant shown to be functional. Similarly, the vast majority of variants shown to be functional through laboratory assays have been positioned in the promoter, although most associations to date have been positioned in enhancer regions. As such, for the vast majority of associated loci there remains a significant amount of work to be done in identifying the underlying causal variant(s) and target genes. As efforts were taken to include variants predicted to have regulatory function or a deleterious effect in the coding region, we believe the inclusion of tagging variants that are most informative across populations will enable better prioritization of likely causal variant associations and thus streamline laboratory follow-up.

To facilitate rapid dissemination of results and methods, as well as promote new collaborations with other studies, PAGE II investigators have created a link within the study website (http://pagestudy.org/) to report usage of MEGA.

Supporting Information

S1 Table. Variants selected for enriching GWAS coverage.

https://doi.org/10.1371/journal.pone.0167758.s001

(TXT)

S2 Table. Variants selected for enriching exome coverage.

https://doi.org/10.1371/journal.pone.0167758.s002

(TXT)

S3 Table. Variants selected for enriching locus fine-mapping content.

https://doi.org/10.1371/journal.pone.0167758.s003

(TXT)

S4 Table. Variants included from regulatory literature review.

https://doi.org/10.1371/journal.pone.0167758.s004

(TXT)

S5 Table. Variant prioritized by trait groups.

https://doi.org/10.1371/journal.pone.0167758.s005

(TXT)

S7 Table. Number of GWAS tagging SNPs per gene.

https://doi.org/10.1371/journal.pone.0167758.s007

(TXT)

S8 Table. Number of GWAS tagging SNPs per SNP.

https://doi.org/10.1371/journal.pone.0167758.s008

(TXT)

Acknowledgments

The PAGE II investigators thank Alex Reiner, Laura Bierut, Maggie C. Ng, Matthew Sampson, Jeffrey Kopp, Eli Stahl, Girish Nadkarni, and Myriam Fornage for sharing their expertise in variant and genomic loci selection for the custom PAGE II content on MEGA.

We thank the participants in the BioMe Biobank for their invaluable contribution to biomedical research.

Author Contributions

  1. Conceptualization: CSC SAB EEK CDB CLK RJFL UP CAH KEN LLM LH RJ TM NF.
  2. Data curation: SAB GLW NZ LWM ARM RWW IC MG LX NF EEK CSC JH.
  3. Funding acquisition: CSC EEK CDB CLK RJFL UP CAH KEN LLM LH RJ TM NF.
  4. Methodology: SAB GLW CRG EEK CSC.
  5. Project administration: TM RJ LH.
  6. Resources: CDB.
  7. Supervision: CSC EEK CDB CLK RJFL UP CAH KEN LLM LH RJ TM NF.
  8. Visualization: SAB GLW.
  9. Writing – original draft: SAB GLW JMK SB RWW MG LH.
  10. Writing – review & editing: SAB GLW ARM JMK LWM SB RWW MG LH.

References

  1. 1. Rosenberg NA, Huang L, Jewett EM, Szpiech ZA, Jankovic I, Boehnke M (2010) Genome-wide association studies in diverse populations. Nat Rev Genet 11: 356–366. nrg2760 [pii]. pmid:20395969
  2. 2. Oh SS, Galanter J, Thakur N, Pino-Yanes M, Barcelo NE, White MJ et al. (2015) Diversity in Clinical and Biomedical Research: A Promise Yet to Be Fulfilled. PLoS Med 12: e1001918. PMEDICINE-D-15-01863 [pii]. pmid:26671224
  3. 3. Campbell MC, Tishkoff SA (2008) African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu Rev Genomics Hum Genet 9: 403–433. pmid:18593304
  4. 4. Matise TC, Ambite JL, Buyske S, Carlson CS, Cole SA, Crawford DC et al. (2011) The Next PAGE in understanding complex traits: design for the analysis of Population Architecture Using Genetics and Epidemiology (PAGE) Study. Am J Epidemiol 174: 849–859. kwr160 [pii]. pmid:21836165
  5. 5. The Women's Health Initiative Study Group. (1998) Design of the Women's Health Initiative clinical trial and observational study. Control Clin Trials 19: 61–109. S0197245697000780 [pii]. pmid:9492970
  6. 6. Kolonel LN, Henderson BE, Hankin JH, Nomura AM, Wilkens LR, Pike MC et al. (2000) A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics. Am J Epidemiol 151: 346–357. pmid:10695593
  7. 7. The ARIC investigators. (1989) The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. Am J Epidemiol 129: 687–702. pmid:2646917
  8. 8. Hughes GH, Cutter G, Donahue R, Friedman GD, Hulley S, Hunkeler E et al. (1987) Recruitment in the Coronary Artery Disease Risk Development in Young Adults (Cardia) Study. Control Clin Trials 8: 68S–73S. pmid:3440391
  9. 9. Sorlie PD, Aviles-Santa LM, Wassertheil-Smoller S, Kaplan RC, Daviglus ML, Giachello AL et al. (2010) Design and implementation of the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol 20: 629–641. S1047-2797(10)00072-4 [pii]. pmid:20609343
  10. 10. H.R.Johnston, N.Rafaels, D.Hu, D.Torgerson, S.Chavan, J.Gao et al (2015) Utilizing an African specific genotyping array for a large-scale GWAS for asthma in African Americans.(Abstract/Program 20). Presented at the 65th Annual Meeting of The American Society of Human Genetics, Balitimore, MD.
  11. 11. C.R.Gignoux, G.L.Wojcik, H.R.Johnston, C.Fuchsberger, S.Shringarpure, A.R.Martin et al (2016) A Multi-Ethnic Genotyping Array for the Next Generation of Association Studies (Abstract/Program 1885). Presented at the 65th Annual Meeting of The American Society of Human Genetics, Balitimore, MD.
  12. 12. Haeussler M, Gerner M, Bergman CM (2011) Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics 27: 980–986. btr043 [pii]. pmid:21325301
  13. 13. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42: D1001–D1006. gkt1229 [pii]. pmid:24316577
  14. 14. G.L.Wojcik, C.R.Gignoux, C.Fuchsberger, D.Taliun, R.Welch, A.R.Martin et al (2015) Tag SNP selection for low frequency variant imputation in populations of diverse ancestry (Abstract/Program 1291).
  15. 15. Bruun GH, Doktor TK, Andresen BS (2013) A synonymous polymorphic variation in ACADM exon 11 affects splicing efficiency and may affect fatty acid oxidation. Mol Genet Metab 110: 122–128. S1096-7192(13)00207-2 [pii]. pmid:23810226
  16. 16. Griseri P, Bourcier C, Hieblot C, Essafi-Benkhadir K, Chamorey E, Touriol C et al. (2011) A synonymous polymorphism of the Tristetraprolin (TTP) gene, an AU-rich mRNA-binding protein, affects translation efficiency and response to Herceptin treatment in breast cancer patients. Hum Mol Genet 20: 4556–4568. ddr390 [pii]. pmid:21875902
  17. 17. Kimchi-Sarfaty C, Oh JM, Kim IW, Sauna ZE, Calcagno AM, Ambudkar SV et al. (2007) A "silent" polymorphism in the MDR1 gene changes substrate specificity. Science 315: 525–528. 1135308 [pii]. pmid:17185560
  18. 18. Nielsen KB, Sorensen S, Cartegni L, Corydon TJ, Doktor TK, Schroeder LD et al. (2007) Seemingly neutral polymorphic variants may confer immunity to splicing-inactivating mutations: a synonymous SNP in exon 5 of MCAD protects from deleterious mutations in a flanking exonic splicing enhancer. Am J Hum Genet 80: 416–432. S0002-9297(07)60091-3 [pii]. pmid:17273963
  19. 19. Ohtsuki T, Koga M, Ishiguro H, Horiuchi Y, Arai M, Niizato K et al. (2008) A polymorphism of the metabotropic glutamate receptor mGluR7 (GRM7) gene is associated with schizophrenia. Schizophr Res 101: 9–16. S0920-9964(08)00091-1 [pii]. pmid:18329248
  20. 20. Papp AC, Pinsonneault JK, Wang D, Newman LC, Gong Y, Johnson JA et al. (2012) Cholesteryl Ester Transfer Protein (CETP) polymorphisms affect mRNA splicing, HDL levels, and sex-dependent cardiovascular risk. PLoS One 7: e31930. PONE-D-11-16896 [pii]. pmid:22403620
  21. 21. Suhy A, Hartmann K, Newman L, Papp A, Toneff T, Hook V et al. (2014) Genetic variants affecting alternative splicing of human cholesteryl ester transfer protein. Biochem Biophys Res Commun 443: 1270–1274. S0006-291X(13)02198-0 [pii]. pmid:24393849
  22. 22. Wang D, Johnson AD, Papp AC, Kroetz DL, Sadee W (2005) Multidrug resistance polypeptide 1 (MDR1, ABCB1) variant 3435C>T affects mRNA stability. Pharmacogenet Genomics 15: 693–704. 01213011-200510000-00003 [pii]. pmid:16141795
  23. 23. Zhu H, Tucker HM, Grear KE, Simpson JF, Manning AK, Cupples LA et al. (2007) A common polymorphism decreases low-density lipoprotein receptor exon 12 splicing efficiency and associates with increased cholesterol. Hum Mol Genet 16: 1765–1772. ddm124 [pii]. pmid:17517690
  24. 24. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA et al. (2011) The variant call format and VCFtools. Bioinformatics 27: 2156–2158. btr330 [pii]. pmid:21653522
  25. 25. Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S et al. (2015) ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. gkv1222 [pii].
  26. 26. McKusick VA (2007) Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80: 588–604. S0002-9297(07)61121-5 [pii]. pmid:17357067
  27. 27. Bustamante CD, Burchard EG, De la Vega FM (2011) Genomics for the world. Nature 475: 163–165. 475163a [pii]. pmid:21753830
  28. 28. Carlson CS, Matise TC, North KE, Haiman CA, Fesinmeyer MD, Buyske S et al. (2013) Generalization and dilution of association results from European GWAS in populations of non-European ancestry: the PAGE study. PLoS Biol 11: e1001661. PBIOLOGY-D-13-00491 [pii]. pmid:24068893
  29. 29. Huo D, Olopade OI (2007) Genetic testing in diverse populations: are researchers doing enough to get out the correct message? JAMA 298: 2910–2911. 298/24/2910 [pii]. pmid:18159061
  30. 30. McClellan J, King MC (2010) Genetic heterogeneity in human disease. Cell 141: 210–217. S0092-8674(10)00320-X [pii]. pmid:20403315