Exome sequencing in large, multiplex bipolar disorder families from Cuba

Anna Maaser; Andreas J. Forstner; Jana Strohmaier; Julian Hecker; Kerstin U. Ludwig; Sugirthan Sivalingam; Fabian Streit; Franziska Degenhardt; Stephanie H. Witt; Céline S. Reinbold; Anna C. Koller; Ruth Raff; Stefanie Heilmann-Heimbach; Sascha B. Fischer; Bipolar Disorder Working Group of the Psychiatric Genomics Consortium; Stefan Herms; Per Hoffmann; Holger Thiele; Peter Nürnberg; Heide Löhlein Fier; Guillermo Orozco-Díaz; Deinys Carmenate-Naranjo; Niurka Proenza-Barzaga; Georg W. J. Auburger; Till F. M. Andlauer; Sven Cichon; Beatriz Marcheco-Teruel; Ole Mors; Marcella Rietschel; Markus M. Nöthen

doi:10.1371/journal.pone.0205895

Abstract

Bipolar disorder (BD) is a major psychiatric illness affecting around 1% of the global population. BD is characterized by recurrent manic and depressive episodes, and has an estimated heritability of around 70%. Research has identified the first BD susceptibility genes. However, the underlying pathways and regulatory networks remain largely unknown. Research suggests that the cumulative impact of common alleles with small effects explains only around 25–38% of the phenotypic variance for BD. A plausible hypothesis therefore is that rare, high penetrance variants may contribute to BD risk. The present study investigated the role of rare, nonsynonymous, and potentially functional variants via whole exome sequencing in 15 BD cases from two large, multiply affected families from Cuba. The high prevalence of BD in these pedigrees renders them promising in terms of the identification of genetic risk variants with large effect sizes. In addition, SNP array data were used to calculate polygenic risk scores for affected and unaffected family members. After correction for multiple testing, no significant increase in polygenic risk scores for common, BD-associated genetic variants was found in BD cases compared to healthy relatives. Exome sequencing identified a total of 17 rare and potentially damaging variants in 17 genes. The identified variants were shared by all investigated BD cases in the respective pedigree. The most promising variant was located in the gene SERPING1 (p.L349F), which has been reported previously as a genome-wide significant risk gene for schizophrenia. The present data suggest novel candidate genes for BD susceptibility, and may facilitate the discovery of disease-relevant pathways and regulatory networks.

Citation: Maaser A, Forstner AJ, Strohmaier J, Hecker J, Ludwig KU, Sivalingam S, et al. (2018) Exome sequencing in large, multiplex bipolar disorder families from Cuba. PLoS ONE 13(10): e0205895. https://doi.org/10.1371/journal.pone.0205895

Editor: Klaus Brusgaard, Odense University Hospital, DENMARK

Received: December 13, 2017; Accepted: October 3, 2018; Published: October 31, 2018

Copyright: © 2018 Maaser et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files. The generated exome sequencing data are available at the European Genome-phenome Archive (EGA; https://ega-archive.org) under accession number EGAS00001003085.

Funding: The study was supported by the German Federal Ministry of Education and Research (BMBF) through the Integrated Network IntegraMent (Integrated Understanding of Causes and Mechanisms in Mental Disorders), under the auspices of the e:Med Programme (grant 01ZX1314A/01ZX1614A to M.M.N. and S.C., grant 01ZX1314G/01ZX1614G to M.R.). A.J.F. received support from the BONFOR program of the Medical Faculty of the University of Bonn. M.M.N. is a member of the DFG-funded Excellence-Cluster ImmunoSensation. The study was also supported by the German Research Foundation (DFG; grant FOR2107; RI908/11-1 to M.R.; NO246/10-1 to M.M.N.). Instruments at the UKB NGS core facility were supported in part by grant DFG-INST 217/674-1 FUGG. The funders had no role in the study design, data collection and analysis, the decision to publish, or the preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Bipolar disorder (BD) is a severe neuropsychiatric disorder with an estimated heritability of around 70% [1]. BD is characterized by recurrent episodes of depression and mania or hypomania. The lifetime prevalence of BD in the general population is around 1% [2], and the World Health Organization ranks BD among the top 10 leading causes of disease burden for the age group 15-to-44 years (The global burden of disease: 2004 update) [3, 4].

Molecular genetic research indicates that BD is a multifactorial disorder, in which both genetic and environmental factors impact disease susceptibility [5]. Recent genome-wide association studies (GWAS) have identified the first BD susceptibility genes [3, 6–10]. However, the underlying pathways and regulatory networks remain largely unknown [11]. Research has shown that the cumulative impact of common alleles with small effects explains only around 25–38% of the phenotypic variance for BD [12, 13]. A plausible hypothesis therefore is that rare, high penetrance variants may contribute to BD susceptibility. One approach to the evaluation of this hypothesis is the investigation of large, multigenerational pedigrees with a high prevalence of BD. In such pedigrees, the existence of a genetic variant of strong effect—as inherited from a common ancestor—may be more likely [14].

Results from initial whole exome and genome sequencing studies of BD suggest that the application of next generation sequencing is a promising method for the identification of rare variants in BD candidate genes and disease-related pathways. Recent investigations have comprised pedigree and trio analyses, as well as case-control studies [15–17]. Preliminary results suggest an enrichment of rare variants in gene-sets previously implicated in BD, such as mitogen-activated protein kinase (MAPK) signaling; axon guidance; calcium signaling; cyclic adenosine monophosphate response element binding protein (CREB) signaling; and potassium channels [18–24]. These results also implicate gene-sets identified in de novo studies of autism and schizophrenia in BD etiology [17]. A recent trio-based exome sequencing study suggested that de novo mutations may play a role in BD etiology, particularly in patients with early disease onset [15]. However, the overlap in implicated candidate genes between studies was limited, suggesting that the analysis of further samples and pedigrees is warranted before more definitive conclusions can be drawn.

The aim of the present study was to investigate the role of rare, nonsynonymous, and potentially functional variants in BD susceptibility by performing whole exome sequencing in two large, multiply affected BD families from Cuba. The high lifetime prevalence of BD in these pedigrees renders them promising in terms of the identification of high penetrance genetic risk variants. In addition, polygenic risk score (PRS) based analysis, that uses cumulated information contained in many small association signals throughout the genome, was performed to elucidate the underlying genetic architecture.

Materials and methods

Phenotypic assessment and DNA collection

The present study was performed within the context of the Cuban Danish project on BD and schizophrenia, and was approved by the ethics committee of the National Center of Medical Genetics in Havana, Cuba. Written informed consent was obtained from all participants prior to inclusion.

All participants were recruited in the southeastern region of Cuba and underwent initial psychiatric assessment in 1997 and follow-up assessment in 2016. Each participant was examined at home by a research psychiatrist in cooperation with a local psychiatrist.

Best estimate diagnoses were assigned on the basis of: the Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) (SCID); structured information obtained from other family members; and information from the responsible psychiatrist. DNA was obtained from whole blood by salting-out with saturated sodium chloride solution [25].

Sample description

The first family comprised six generations and 32 individuals (41% male), including 17 individuals with a known psychiatric disorder. DNA was available for a total of 11 individuals (Fig 1A) with the following diagnoses: BD type I (BD I, n = 3); BD type II (BD II, n = 1); BD not otherwise specified (BD NOS, n = 1); alcohol abuse (n = 1); and no current or lifetime psychiatric disorder (healthy subjects, n = 5). The second family (Fig 1B) was bilineal. This family comprised four generations and 71 individuals (54% male), including 32 individuals with a known DSM-IV psychiatric disorder. DNA was available for 20 individuals with the following diagnoses: BD I (n = 4); BD II (n = 9); major depressive disorder (MDD, n = 1); alcohol abuse (n = 2); and healthy subjects (n = 4). To facilitate the analyses, family 2 was divided into subpedigrees 2, 3, and 4. For the 15 individuals selected for whole exome sequencing, additional clinical information is provided in S1 Table.

Download:

Fig 1. Pedigrees of the two investigated Cuban families.

DNA was available for all numbered individuals depicted by a diamond with a central dot. Exome sequencing was performed on subjects framed with a red circle. Orange indicates a diagnosis of BD I, BD II, or BD NOS. Blue indicates other psychiatric phenotypes, comprising recurrent MDD, MDD single episode, and alcohol abuse. Unaffected individuals are indicated by white diamonds. To preserve the anonymity and confidentiality of the families, no information is shown concerning sex or mortality. (A) Pedigree of family 1, including four cases selected for exome sequencing. (B) Family 2 was divided in pedigrees 2, 3, and 4. A total of 11 cases from family 2 were selected for exome sequencing.

https://doi.org/10.1371/journal.pone.0205895.g001

Single nucleotide polymorphism (SNP) genotype data

To confirm relatedness in the pedigrees, reported relationships were compared with SNP array genotype data for all samples with available DNA (n = 31). All individuals were genotyped on the Infinium PsychArray BeadChip (PsychChip, Illumina, San Diego, CA, USA), in accordance with the manufacturer’s recommendations. Genotype data were analyzed using the integrated Illumina data analysis software platform GenomeStudiov2011.1. Quality control (QC) was performed using PLINK (http://zzz.bwh.harvard.edu/plink/download.shtml) [26], in accordance with the standard QC procedures of the Psychiatric Genomics Consortium (PGC) (https://sites.google.com/a/broadinstitute.org/psych-chip-resources/qc-methods#TOC-Standard-PGC-GWAS-QC). Imputation was conducted with IMPUTE2 [27, 28], using the 1000 Genomes Project phase 3 reference and standard parameters [29, 30].

Polygenic risk scores (PRS)

PRS were calculated for all members of the two Cuban families with available SNP array genotype data. The calculations were performed using the summary statistics of the largest GWAS of BD to date. This GWAS was performed by the PGC, and comprised 20,352 cases and 31,358 controls [31]. After clumping of summary statistics based on best-guess genotype data, PRS were calculated using imputed dosage data in R v3.3 [32]. In all PRS computations, the allele with the positive effect direction was considered. The PRS therefore represent weighted, cumulative, additive risk. Cumulative PRS (sum of risk load) were calculated using 10 different p-value thresholds: <5×10⁻⁸, <1×10⁻⁷, <1×10⁻⁶, <1×10^‑5, <1×10⁻⁴, <0.001, <0.01, <0.05, <0.1, and <0.2. PRS were scaled to represent relative risk load (minimum possible cumulative risk load = 0, maximum = 1). PRS analyses were conducted in R v3.3, using the function polygenic of the R package GenABEL, which implements a linear mixed model that takes family structure into account [33]. Gender was used as a covariate in the analysis. In a first step, residuals were calculated in the polygenic model using the formula phenotype ~ covariates, plus the genetic relationship matrix as a random effect. Residuals from this model were then used in a second linear model, with the formula residuals ~ PRS. In the second step, association test statistics with 95% confidence intervals (CI) were calculated using bootstrapping (R package boot, non-parametric bootstrapping using ordinary resampling with 2,000 replications) [34, 35]. In accordance with the hypothesis that family members with BD have increased PRS for BD compared to healthy relatives, one-sided p-values were calculated for all PRS-based analyses. Since the residuals were not normally distributed, p-values were confirmed using permutation analyses (10,000 permutations) (S2 Table).

Exome sequencing and data analysis procedure

The exome sequencing step included a total of 15 BD cases (8 females, 7 males) from the two families. The three remaining BD cases were excluded. Two of these three individuals (individual 11, family 1; and individual 19, family 2) were excluded as they had been diagnosed with MDD in the initial 1997 assessment, and assigned a diagnosis of BD II in the reassessment in 2016, when exome sequencing in individuals with a BD diagnosis had already been performed. Individual 17 of family 2 was excluded on the grounds of being related by marriage only. The initial diagnoses of the BD cases, as well as the respective subtypes (BD I, II, or NOS), remained stable, and were confirmed during the second assessment. The SureSelect^XT Human All Exon V5 capture library from Agilent Technologies (Santa Clara, California, USA) was used for target enrichment. Exome sequencing was performed on an Illumina HiSeq2500 v4 system (San Diego, California, USA) using 2 x 125 bp paired-end sequencing. A brief technical report is provided in Table 1. The generated exome sequencing data are available at the European Genome-phenome Archive (EGA; https://ega-archive.org) under accession number EGAS00001003085.

Download:

Table 1. Exome sequencing: Technical report summary.

https://doi.org/10.1371/journal.pone.0205895.t001

For the analysis of pedigree data, the Varbank pipeline of the Cologne Center of Genomics was used (https://varbank.ccg.uni-koeln.de). This pipeline provides a user-friendly interface and incorporates a set of standard filters and annotations. The first filter was set to detect heterozygous variants with an allele read frequency of 25–75% relative to the reference. Variants with a minimum coverage of 10x and a quality score of 10 were included. The Varbank workflow includes the Genome Analysis Toolkit (GATK), and the GATK Best Practices workflow was applied. To decrease the number of specific pipeline artifacts, the default settings of the Varbank in-house database filter (comprising: epilepsy data, n = 511; and mixed case and control structural data, n = 611) were applied. The analysis of exome sequencing data focused on single-nucleotide variants (SNVs), as well as small insertions and deletions (InDels). The analysis was restricted to nonsynonymous variants with an effect on primary protein structure or a strong splice site effect. Focus was placed on variants that were shared by all BD cases within the respective pedigree.

To identify rare alleles, filtering for variants with a minor allele frequency (MAF) of < 0.1% was performed using the data of the Exome Aggregation Consortium (ExAC; http://exac.broadinstitute.org, version 0.3, 2015). For each variant, information concerning in silico predicted effects on protein structure was obtained from the database for nonsynonymous SNPs’ functional predictions (dbNSFP). This database provides functional prediction scores from various algorithms, and annotations of nonsynonymous SNVs [36, 37]. In accordance with Purcell et al., the analyses involved the commonly applied predictions of SIFT; Polyphen-2 (HumDiv- and HumVar-trained models); LRT; and MutationTaster [38]. Only variants that were predicted to have a potentially damaging (damaging/probably damaging/possibly damaging/deleterious/disease-causing) impact on protein structure by at least four of the five algorithms were included in subsequent analyses. Predictions for InDels were obtained from the tools MutationTaster and SIFT/Provean [39, 40]. InDels predicted to have a possible deleterious effect by at least one of the three tools were included in the subsequent analyses.

To verify the exome sequencing results, Sanger sequencing was performed using standard polymerase chain reaction conditions and a 3130xl Genetic Analyzer (Applied Biosystems, Foster City, California, USA). Primer sequences are available upon request.

To estimate possible population effects, Sanger sequencing of the identified variants was performed in all individuals with available DNA from the two unrelated Cuban families, including individuals related by marriage only. These sequencing data were also used to follow up the segregation of variants in each family.

Brain-expression

To determine whether the identified candidate genes are expressed in the human brain, the Genotype-Tissue Expression (GTEx) Project portal (http://www.gtexportal.org/, on 05/11/2017) was accessed. Candidate genes with a Reads Per Kilobase per Million mapped reads (RPKM) of ≥ 0.1 in hippocampal tissue were classified as showing brain-expression.

Non-parametric linkage analysis

The SNP array data were used to perform non-parametric linkage (NPL) analyses. NPL analysis is the method of choice for pedigree studies of complex genetic traits, since it is robust to uncertainty concerning the mode of inheritance [41]. NPL investigates whether affected individuals share alleles identity-by-descent more often than would be expected under a random segregation [41]. First, a QC of the SNP array data was performed using PLINK 1.9 (https://www.cog-genomics.org/plink2) [42]. SNPs were filtered for a call rate of at least 95% and an MAF of ≥ 5%. After linkage disequilibrium pruning with a window size of 500 kb and an r² threshold of 0.01, a set of 9690 SNPs remained. These were used for subsequent NPL analyses of pedigree 1 and subpedigrees 2 and 3. In addition, we also performed NPL analyses for the 17 rare candidate variants identified in the exome sequencing step. The two NPL Z-score statistics NPL_all and NPL_pairs were computed using the MERLIN software version 1.1.2 [43, 44]. Both the standard linear model and the exponential likelihood model [45] implemented in MERLIN were applied to generate model-free LOD scores. Here, a strict phenotype definition was applied, whereby subjects with a diagnosis of BD I, BD II or BD NOS were classified as “affected”; healthy individuals were classified as “unaffected” and individuals with other psychiatric phenotypes were classified as “unknown”.

Population substructure analysis

For the population substructure analysis, post-QC pre-imputation SNP array data were used. The analysis was carried out using PLINK 1.9 [42] and R. First, genetic data from the 1000 Genomes Project Phase 3 reference panel [29, 30] were converted to the PLINK binary format and merged with the genotype data of the Cuban pedigrees. Second, the combined dataset was subjected to the following three variant filtering steps: (i) removal of variants with an MAF of < 0.05 or a HWE p-value of < 10^-3; (ii) removal of variants mapping to the extended MHC region (chromosome 6, 25–35 Mb) or to a typical inversion site on chromosome 8 (7–13 Mb); and (iii) LD pruning (command—indep-pairwise 200 100 0.2). Next, the pairwise identity-by-state matrix of all individuals was calculated using the command—genome on the filtered genotype data. A multidimensional scaling (MDS) analysis was then performed on the identity-by-state matrix using the eigendecomposition-based algorithm in PLINK.

Populations from 1000 Genomes were selected for the analysis using backward elimination, starting with all populations and retaining only the ones to which Cuban family members showed close clustering. The final population selection was: African ancestry in Barbados and the USA (ACB, ASW); Puerto Rican (PUR); Spanish (IBS); and Tuscan Italian (TSI). In an MDS analysis, high relatedness between family members leads to artifacts. To avoid such artifacts, only one representative subject of each highly related cluster was included in the analysis. Therefore, data on only 19 of the 31 family members are displayed (S1 Fig).

Results

Relatedness and polygenic risk scores based on SNP genotype data

SNP array genotyping data confirmed the pedigree relationships depicted in Fig 1.

After correction for multiple testing, PRS based on BD-associated genetic variants were not significantly higher in BD cases than in family members with no lifetime history of psychiatric illness (S2 Fig). When PRS calculation was restricted to BD variants with genome-wide significance (p-value threshold 5×10⁻⁸), the PRS of BD cases from the two Cuban families showed a nominally significant increase compared to the PRS of family members with no psychiatric disorder or a non-BD psychiatric phenotype (p = 0.027, S2B Fig). Since affected family members showed no significant increase in the burden of common variants, an analysis of rare variants in these patients was considered a promising approach.

Pedigree information and technical summary

To identify rare variants with a potential contribution to BD susceptibility, a total of 15 individuals with BD I, BD II, or BD NOS were selected from the two Cuban families for whole exome sequencing. These comprised four subjects from family 1 (Fig 1A); and 11 subjects from family 2 (Fig 1B). Mean coverage ranged from 75x to 116x. On average, 91.5% of the target regions had a minimum coverage of 30x (Table 1). Family 2 was divided into subpedigrees 2, 3, and 4. Filtering and rare variant identification was performed separately for each subpedigree.

Analysis of exome sequencing data and rare variant identification

Application of the Varbank pipeline of the Cologne Center for Genomics (https://varbank.ccg.uni-koeln.de) led to the identification of an average of 1,019 variants with an MAF of < 1% per exome. A total of 20, 30, and 126 variants were shared by BD cases in pedigrees 1, 2, and 3, respectively. To identify variants of potentially higher penetrance, the analysis was restricted to very rare variants with an MAF of < 0.1%. This reduced the number of shared variants in the aforementioned pedigrees to 9, 14, and 59, respectively. Pedigrees 2 and 3 are unrelated, and gave rise to bilineal pedigree 4. Thus, none of the variants were shared by all BD cases from pedigree 4.

To prioritize the identified variants, filtering was performed for alleles that were potentially functional and predicted to be potentially damaging by at least four of five (SNVs) or one of three (InDels) of the applied prediction tools. This reduced the number of rare variants to 2, 3, and 12 variants in the respective pedigrees. These 17 candidate variants comprised 14 missense variants; and 3 small InDels, which all lead to a frameshift (Table 2).

Download:

Table 2. Rare exome variants identified in affected individuals from the investigated Cuban pedigrees.

https://doi.org/10.1371/journal.pone.0205895.t002

A broader variant list is provided in the S3 Table. This was generated using relaxed filter criteria (MAF ≤ 5% according to ExAC and predicted to be potentially damaging by at least one of the five prediction tools).

Variants identified in pedigree 1 and subpedigrees 2 and 3. Bold formatting indicates alleles with a predicted potentially damaging effect on protein structure or function. Variants are sorted according to chromosomal position. Abbreviations: Chr, chromosome; Position, chromosomal position of respective variant according to hg19/GRCh37; Ref, reference allele; Alt, alternative allele; ExAC MAF, minor allele frequency according to the data of the Exome Aggregation Consortium; AA change, amino acid change; PPh-2, PolyPhen-2; D, damaging/probably damaging/deleterious/disease-causing; P, possibly damaging; B, benign; T, tolerated; N, neutral; NA, not available.

Segregation analysis

In the segregation analysis, testing for all 17 identified rare variants with potentially damaging effects on protein function was performed in all family members with available DNA from both families (Tables 3–5).

Download:

Table 3. Results of the segregation analysis of sequence variants in pedigree 1 as well as brain expression of the implicated genes and data from the NPL analyses.

https://doi.org/10.1371/journal.pone.0205895.t003

Download:

Table 4. Results of the segregation analysis of sequence variants in subpedigrees 2 and 4 as well as brain expression of the implicated genes and data from the NPL analyses.

https://doi.org/10.1371/journal.pone.0205895.t004

Download:

Table 5. Results of the segregation analysis of sequence variants in subpedigrees 3 and 4 as well as brain expression of the implicated genes and data from the NPL analyses.

https://doi.org/10.1371/journal.pone.0205895.t005

In pedigree 1 (Table 3), one rare variant located in the gene TMEM220 segregated with all five BD cases across four generations (Fig 1A; individuals 4, 5, 7, 9, and 11). This variant was also identified in two healthy relatives (Fig 1A; individuals 1 and 3), and one relative with alcohol abuse who had two children with a history of MDD (Fig 1A, individual 10).

The second candidate variant (rs141075266) was located in the SERPING1 gene. This variant segregated with BD in the four BD cases who had undergone exome sequencing, but was not present in the BD-affected offspring in the fifth generation (Fig 1A, individual 11). The variant was also identified in a healthy parent in the third generation, who was an obligate carrier (Fig 1A, individual 3).

In pedigree 2 (Table 4), three rare variants were identified in all BD cases: RCCD1; DNAH7 (rs201185180); and ABCA4 (rs61748549). Further evaluation of segregation in pedigree 2 was precluded by the absence of additional individuals with available DNA.

In pedigree 3 (Table 5), none of the candidate variants identified in the exome-sequenced BD cases showed complete penetrance, since they were also detected in at least one of three healthy relatives (Fig 1B; individuals 7, 8, and 20). Interestingly, the variants identified in EPS8L3 (rs148185176), MYH7, CRX (rs139340178), and SELENOO, were also present in two relatives with alcohol abuse (Fig 1B, individuals 6 and 18).

For candidate variants identified in pedigrees 2 and 3, segregation was also investigated in four affected siblings from bilineal pedigree 4. In three of these four siblings, two missense variants in the genes COL3A1 and THYN1 were detected (Fig 1B; individuals 11, 12, 14, and 15). Individual 14 had transmitted the COL3A1 variant to their BD II-affected offspring (Fig 1B; individual 19). On average, the remaining variants were present in 50% of the siblings and their offspring individual 19, in accordance with Mendelian expectation. One exception was the variant located in the gene SELENOO, which was not transmitted to any of the siblings. Tables 3–5 provide both a summary of these analyses, and details of the prioritized variants in the two multigenerational Cuban families.

All candidate variants identified in one pedigree were then sequenced in the other pedigrees and in relatives-by-marriage. None of the variants were found in the other family, or in subjects related by marriage only.