Lessons Learned from Whole Exome Sequencing in Multiplex Families Affected by a Complex Genetic Disorder, Intracranial Aneurysm

Janice L. Farlow; Hai Lin; Laura Sauerbeck; Dongbing Lai; Daniel L. Koller; Elizabeth Pugh; Kurt Hetrick; Hua Ling; Rachel Kleinloog; Pieter van der Vlies; Patrick Deelen; Morris A. Swertz; Bon H. Verweij; Luca Regli; Gabriel J. E. Rinkel; Ynte M. Ruigrok; Kimberly Doheny; Yunlong Liu; Joseph Broderick; Tatiana Foroud; FIA Study Investigators

doi:10.1371/journal.pone.0121104

Abstract

Genetic risk factors for intracranial aneurysm (IA) are not yet fully understood. Genomewide association studies have been successful at identifying common variants; however, the role of rare variation in IA susceptibility has not been fully explored. In this study, we report the use of whole exome sequencing (WES) in seven densely-affected families (45 individuals) recruited as part of the Familial Intracranial Aneurysm study. WES variants were prioritized by functional prediction, frequency, predicted pathogenicity, and segregation within families. Using these criteria, 68 variants in 68 genes were prioritized across the seven families. Of the genes that were expressed in IA tissue, one gene (TMEM132B) was differentially expressed in aneurysmal samples (n=44) as compared to control samples (n=16) (false discovery rate adjusted p-value=0.023). We demonstrate that sequencing of densely affected families permits exploration of the role of rare variants in a relatively common disease such as IA, although there are important study design considerations for applying sequencing to complex disorders. In this study, we explore methods of WES variant prioritization, including the incorporation of unaffected individuals, multipoint linkage analysis, biological pathway information, and transcriptome profiling. Further studies are needed to validate and characterize the set of variants and genes identified in this study.

Figures

Citation: Farlow JL, Lin H, Sauerbeck L, Lai D, Koller DL, Pugh E, et al. (2015) Lessons Learned from Whole Exome Sequencing in Multiplex Families Affected by a Complex Genetic Disorder, Intracranial Aneurysm. PLoS ONE 10(3): e0121104. https://doi.org/10.1371/journal.pone.0121104

Academic Editor: Yun Li, University of North Carolina, UNITED STATES

Received: August 6, 2014; Accepted: February 10, 2015; Published: March 24, 2015

Copyright: © 2015 Farlow et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: The dataset, consisting of called variants, subject phenotypes, and pedigree information for the multiplex IA families can be requested directly from the National Center for Biotechnology Information (NCBI) Database of Genotypes and Phenotypes (dbGaP) (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000636.v1.p1) (accession phs000636). Mapped reads are available on the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) (accession SRX329208-SRX329252).

Funding: National Institutes of Health (www.nih.gov) R01NS39512 (JB); National Institutes of Health R03NS083468 (TF); National Institutes of Health Contract Numbers HHSN268201100011I, HHSN268200782096C, and HHSN268201200008I (KD); National Institutes of Health TL1 TR000162 (JLF); National Institutes of Health T32 GM077229 (JLF); George J. and Elizabeth Wile Neuroscience Research Endowment Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Subarachnoid hemorrhage (SAH) is the most devastating subtype of stroke. Fatality from SAH between 21 days to one month of the hemorrhage ranges from 25–35% in high-income countries to almost 50% in low- to middle-income countries [1]. Up to 80–90% of SAH cases are caused by rupture of intracranial aneurysms (IA), which are present in approximately 3% of the population [2]. Smoking and hypertension are important risk factors, increasing the risk of IA rupture by 3.1 and 2.6 times respectively [3]. The risk of an IA and for IA rupture is also increased among individuals having a first-degree relative with a history of an IA [2, 4, 5]. The location and number of IAs in a given individual also appears to be influenced by a family history [6]. Thus, several lines of evidence suggest that IA is due to both genetic and environmental risk factors. Unfortunately, until more is understood about these risk factors, the severe morbidity and mortality associated with this disease will continue to be a large public health burden.

Several approaches have been employed to identify genes contributing to IA. Initial studies utilized pedigrees having multiple affected members. Analyses in these initial studies detected linkage to several chromosomal regions (1p34.3–36.13 [7, 8], 4q32.2 [9], 6p23 [10],7q11 [7], 7q36.3 [9], 8q12.1 [9], 11q24–25 [10–12], 12q21.33 [9], and 14q23–31 [12]); however, the causative gene was not identified in any of these regions. More recently, genomewide association studies (GWAS) have focused on the role of common variants that might individually have a small effect on disease risk. Analyses have consistently detected association to single nucleotide polymorphisms in CDKN2BAS, also known as ANRIL, on chromosome 9p21.3 [13–15], as well as SOX17 on chromosome 8q12.1 [13–15]. Association has also been reported to EDNRA on chromosome 4q31 [16], CNNM2 on chromosome 10q24 [14], KL/STARD13 on chromosome 13q13 [14], and RBBP8 on chromosome 18q11 [14]. Together, these genes only explain a fraction of the population attributable risk for IA.

Advances in technology, especially in the development of high-throughput sequencing, now make it possible to efficiently search for rare variants having a large effect on disease risk. These rare variants may point to novel genes and pathways that are critical to improve the molecular understanding of IA and methods of predicting those at greatest risk. In the present work, whole exome sequencing (WES) was applied to a unique set of families densely affected with IA to investigate the role of rare genetic variation in disease susceptibility and to demonstrate important study design considerations for WES studies in complex disease.

Materials and Methods

Families Selected for Whole Exome Sequencing

Individuals were recruited as part of the Familial Intracranial Aneurysm (FIA) Study [17]. Study approval was granted by institutional review boards at Indiana University, University of Cincinnati, Mayo Clinic, University of Medicine and Dentistry of New Jersey, Cleveland Clinic, Columbia University, University of Texas Houston, Indianapolis Neurosurgical Group, Goodman Campbell Brain and Spine, University of Western Ontario, University of Maryland, McGill University, University of Montreal, Notre Dame Hospital, Ackland UniServices, The George Institute, Royal Perth Hospital, Sir Charles Gairdner Hospital, Royal Adlaide Hospital, Royal Melbourne Hospital, Alfred Hospital, Royal North Shore Hospital, Westmead Hospital, Royal Prince Alfred Hospital, Northwestern University, University of Ottawa, University of Pittsburgh, Stanford University, University of California San Francisco, University of Southern California, University of Virginia, University of Washington, University of Manitoba, University of Alabama Birmingham, Allegheny General Hospital, Brigham and Women’s Hospital, Massachusetts General Hospital, University of Florida, Johns Hopkins University, University of Michigan, and Washington University in St. Louis. Written consent was obtained from all study participants.

Families were recruited to ensure that DNA could be obtained from at least two living affected relatives and that the family would be informative for linkage analysis. Exclusion criteria included (i) a fusiform-shaped unruptured IA of a major intracranial trunk artery; (ii) an IA that is part of an arteriovenous malformation; (iii) a family or personal history of polycystic kidney disease, Ehlers Danlos syndrome, Marfan’s syndrome, fibromuscular dysplasia, or Moya-Moya disease; or (iv) failure to obtain informed consent from the patient or family members. To identify unruptured IA, magnetic resonance angiography (MRA) was offered to first degree relatives of affected family members who had a higher risk of IA as defined by: 1) 30 years of age or older and 2) either a 10 pack year history of smoking or an average blood pressure of ≥140 mmHg systolic or ≥90 mmHg diastolic.

Only individuals having an IA based on an intra-arterial angiogram, operative report, autopsy, or size ≥7 mm on non-invasive imaging (MRA) were considered “definite” cases (Table 1). Two neurologists independently reviewed each record to determine if a subject met all inclusion and exclusion criteria. In case of disagreement, a third neurologist reviewed the data.

Download:

Table 1. Disease phenotypes.

https://doi.org/10.1371/journal.pone.0121104.t001

Seven families of European American descent with the highest density of affected individuals who also had DNA available were selected for WES [18] (Fig. 1). All affected individuals for which sufficient DNA was available were selected for sequencing. Unaffected individuals were selected only if there was an MRA conducted that confirmed the absence of an IA at 45 years or older and if there was sufficient DNA available. One clinically unaffected individual in family E was assumed to be an obligate carrier and was sequenced with her offspring to allow confirmation of allele transmission. Within the seven families, 45 individuals were chosen for WES.

Download:

Fig 1. Simplified pedigrees for the 7 whole exome sequencing families.

Only sequenced individuals and those needed to preserve generational structure are shown to protect the anonymity of the pedigree. IA = intracranial aneurysm. All affected individuals are definite IA unless noted as a probable IA, possible IA, or aortic abdominal aneurysm (AAA). Criteria for defining definite, probable, and possible IA statuses are outlined in Table 1. All unaffected individuals, with the exception of individual E-9, had an MRA performed that did not show evidence of an IA. Grey indicates an unknown phenotype. An ‘S’ above an individual denotes that the individual was selected for sequencing.

https://doi.org/10.1371/journal.pone.0121104.g001

Whole Exome Sequencing

WES was performed at the Center for Inherited Disease Research (CIDR, Johns Hopkins University). Exonic sequences were captured using the Agilent SureSelect Human All Exon 50Mb kit, and paired-end sequencing was performed on the Illumina HiSeq 2000 system, using Flowcell version 3 and TruSeq Cluster Kit version 3. All samples were genotyped using the Illumina HumanOmniExpress-12v1_C platform for quality assurance. Two HapMap samples and two study duplicates were used to ensure library preparation batch quality.

Whole Exome Sequencing Bioinformatics

Primary analysis was done using HiSeq Controls Software and Runtime Analysis Software. The CIDRSeqSuite pipeline was used for secondary bioinformatics analysis, which consists mainly of alignment using Burrows Wheeler Aligner (BWA version 0.5.9) [19] to the human genome reference sequence (build hg19) and applying the Genome Analysis Toolkit (GATK version 1.0.4705) [20] to perform local realignment and base quality score recalibration. Duplicate molecules were flagged and mate-pair information synchronized using Picard (version 1.52, http://picard.sourceforge.net/). The GATK Unified Genotyper (GATK version 1.2–29) was used for multi-sample variant calling. The dataset, consisting of called variants, subject phenotypes, and pedigree information for the multiplex IA families can be requested directly from the National Center for Biotechnology Information (NCBI) Database of Genotypes and Phenotypes (dbGaP) (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000636.v1.p1) (accession phs000636). Mapped reads are available on the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) (accession SRX329208-SRX329252).

GATK Variant Quality Score Recalibration (VQSR, GATK version 1.2–38) [21] created a high-quality call set for SNVs by using an adaptive error model to estimate the likelihood of true genotype calls based on aggregating information across multiple quality metrics. As recommended by GATK, HapMap 3.3 and the Illumina Omni 2.5M chip sites, available from the GATK bundle 1.2, were used as training sets and the annotations of Quality by Depth, Haplotype Score, Mapping Quality Rank Sum, Read Position Rank Sum, Fisher Strand Bias Test, and Mapping Quality were used as quality metrics for the recalibration. SNVs were filtered until 99% of the overlapping HapMap 3.3 sites were retained after application of VQSR. Insertion/deletions were removed if they had a quality by depth < 2.0, ReadPosRankSum < -20.0 (Z-score from Wilcoxon rank sum test of alternative versus reference read position bias), Fisher’s Strand Bias > 200.0 (phred-scaled p-value using Fisher’s exact test to detect strand bias), and/or a homopolymer run > 5.

ANNOVAR [22] was used to annotate variants for location, predicted effect on the protein across three gene databases (RefSeq, UCSC, and Ensembl), and corresponding gene and transcript length. Allele frequencies within European American populations in 1000 Genomes (February 2012 release, http://www.1000genomes.org) [23] and the Exome Sequencing Project (ESP) (ESP6500 release with insertion/deletions and chromosome X and Y calls, http://evs.gs.washington.edu/EVS/) [24] were recorded using custom scripts. The scripts mapped variants to 1000 Genomes and ESP based on chromosomal position and reference and alternate alleles to determine allele frequencies. If a variant was not found in 1000 Genomes or ESP, the alternate allele frequency was set to 0. If a variant was found in both 1000 Genomes and ESP, the smaller alternate allele frequency was taken as the consensus frequency.

Variants were annotated for binned minor allele frequencies from 290 samples without a known cardiovascular phenotype that were exome sequenced at CIDR using identical capturing and sequencing technology, although SAMtools [25] was used for variant calling instead of GATK Unified Genotyper. Variants that were monomorphic across all samples were also flagged.

Variants were also annotated using custom scripts for Gene Ontology (GO) (http://www.geneontology.org) [26] terms that were hypothesized to play a role in IA pathophysiology. GO terms used included GO:0001944 (vasculature development), GO:0001570 (vasculogenesis), GO:0003018 (vascular process in circulatory system), GO:0005581 (collagen), GO:0005604 (basement membrane), and GO:0051541 (elastin metabolic process).

Two programs were used to predict the pathogenicity of SNVs: SIFT [27] and PolyPhen-2 [28]. Scores of damaging for SIFT, or possibly or probably damaging for PolyPhen-2, were accepted as evidence for pathogenicity. Two additional programs were used to analyze the effect of insertions and deletions: SIFT-INDEL [29] for those that cause a frameshift, and DDIG-in for those that do not cause a frameshift [30]. Variants were also annotated for C-scores from the Combined Annotation Dependent Depletion (CADD) webserver (http://cadd.gs.washington.edu) [31]. C-scores of 10 or greater, corresponding to the 10% most deleterious substitutions in the human genome according to CADD, were considered damaging predictions.

Biological filtering retained loci if they: 1) were autosomal variants; 2) were predicted to be nonsynonymous SNVs or insertion/deletions in an exonic and/or splicing region (within 2 bp of a splicing junction, as annotated by ANNOVAR) based on RefSeq, UCSC, and Ensembl annotations; 3) had an allele frequency in European American populations <1% (1000 Genomes, ESP); 4) had an allele frequency less than 1% in CIDR binned minor allele frequencies and were not monomorphic across all samples; 5) were predicted most likely to be damaging by CADD and by at least one other protein prediction program; and 6) segregated with all individuals with a definite IA and obligate carriers in at least one family. All alignments for variants passing these biological filters were visually inspected using the Integrated Genomics Viewer [32] to confirm presence of a variant. Visual inspection for each variant included reviewing read pair orientations, mappability, and soft-clipping; variants that were called nearby; overall depth of sequencing and genomic features that might have inhibited coverage at that locus; and repetitive sequence that might have influenced the position or variant allele called for the locus. In addition to the filters described above, insertion/deletions were also compared against a different dataset consisting of approximately 500 samples without a known cardiovascular phenotype. This comparison dataset used GATK Unified Genotyper (version 2.3–9) for variant calling and the Agilent SureSelect Human All Exon 51Mb capture kit. If the allele designations and/or positions did not match between the two datasets but were within 10 bp, manual review of both the IA and comparison BAM files with IGV was done to reconcile differences in allele designations and position assignments between the two datasets.

Loci were also annotated if they: A) segregated with all aneurysms (including probable and possible IA and the one abdominal aortic aneurysm case in family G) and B) were not found in any sequenced unaffected individuals, excluding assumed obligate carriers.

Linkage

The 7 families were included as part of a larger linkage study of 2,317 individuals from 394 families using the 6K Illumina array [9]. Multipoint parametric linkage analysis (autosomal dominant inheritance, 1% disease allele frequency was performed using Merlin [33]. Only genotypic data from family members with definite IA and obligate carriers were included in the linkage analysis. WES variants were annotated for the highest LOD score obtained by linkage markers within a 10Mb window centered on the sequencing variant. A maximum possible LOD score for each family was calculated by simulating a hypothetical fully informative marker using the aforementioned model parameters and the pedigrees for each family.

Tissue Collection for RNA Expression

Aneurysm biopsies from the aneurysm fundus distal to the clip were collected from patients undergoing neurosurgical clipping of an IA at the Department of Neurology and Neurosurgery in the University Medical Center Utrecht in the Netherlands. These patients were completely independent of the families included for WES. Patients undergoing surgery because of intractable epilepsy were included as controls, and part of a superficial cortical artery in the resected part of the brain was excised as control vessel tissue. Samples were collected from 44 aneurysm biopsies (22 ruptured, 21 unruptured, 1 with unknown rupture status) and 16 control biopsies. All samples were immediately snap frozen in liquid nitrogen less than 1 minute after excision and stored at -80°C until further use.

RNA Isolation, Sample Preparation, and Sequencing

RNA isolation, sample preparation, and sequencing was conducted at the University Medical Center Groningen in Groningen, the Netherlands. Each sample was homogenized with zirconia/silica beads in the BeadBeater machine (BioSpec products, Inc.). After homogenization, total RNA was extracted and purified using an RNeasy microkit (Qiagen, Valencia, CA, USA) according to the manufacturer’s instructions. An initial quality check of the samples by capillary electrophoresis and RNA quantification for each sample was performed using the LabChip GX (PerkinElmer, Waltham, Massachusetts, USA). Samples with a minimum amount of 7 ng non-degraded RNA were selected for subsequent sequencing analysis. Sequence libraries were generated using the TruSeq RNA sample preparation kit from Illumina (San Diego, USA) using the Sciclone NGS Liquid Handler (Perkin Elmer). To remove contamination of adapter-duplexes, an extra purification of the libraries was performed with the automated agarose gel separation system Labchip XT (Perkin Elmer). The obtained cDNA fragment libraries were sequenced on an Illumina HiSeq2000 using default parameters (single read 1x100bp) in pools of 10 or 11 samples. Processing of the raw data, including a demultiplexing step, was performed using Casava software (Illumina) with standard settings.

Differential Expression Analysis

Sequencing reads with quality score under Phred Score <30 were discarded. The quality filtered trimmed fastQ files were then aligned to the human reference genome (hg19) using the STAR aligner [34], allowing for 2 mismatches. SAMtools version 0.1.18 [25] was used to sort the aligned reads. Gene level quantification was performed by HTSeq-0.5.4 [35] using parameters—mode = union—stranded = no and Ensembl version 71 as the gene annotation database.

R version 3.1.0 was used for differential expression analysis. The counts per gene for each sample obtained after alignment were used as input for the analysis. Low count genes (genes with less than 1 read per million in n of the samples, where n is the size of the smallest group of replicates, i.e. n = 16) were filtered out since there is little power to detect significant evidence of differential expression in these genes [36].

The Bioconductor (version 2.14) packages edgeR (version 3.6.2) and limma (version 3.20.2) were used for subsequent steps. To correct for technical influences, edgeR adjusts for varying sequencing depths between samples and normalizes for the RNA composition of the sample. A generalized linear model was used to test for differential expression between aneurysmal and control tissue. Other factors included in the model were age and sex of patients, as well as rupture status. Common and tagwise dispersion estimates were calculated with the Cox-Reid profile adjusted likelihood method to be able to correct for the technical and biological variation when fitting the multivariate negative binomial model. In estimating the tagwise dispersion, the program default for degrees of freedom (df = 10) was used. A negative binomial generalized log-linear model, using the tagwise dispersion estimates, was fitted to the read counts for each gene, and a gene-wise statistical test was performed. Then, a likelihood ratio test was performed. Benjamini Hochberg false discovery rates (FDR) for a transcriptome-wide experiment were calculated to correct for multiple testing. All genes with an FDR adjusted p-value <0.05 were considered individual genes of interest.

Results

Sequencing Data Quality

The average study duplicate reproducibility of SNV and insertion/deletion calls were 99.13% and 94.42%, respectively, and genotypes for non-reference calls per sample from the WES data achieved an average 99.57% concordance with genotype calls from the Illumina HumanOmniExpress-12v1_C array. The average sensitivity to heterozygote calls on the array was 98.13%. After application of GATK quality filters, 98,351 SNVs and 5,851 insertion/deletions were retained. The transition-transversion ratio for exonic variants and percent of SNVs in dbSNP 137, both measures of the quality of the data, were 3.3 and 94.79% respectively.

Biological Filtering

The number of variants retained after each biological filter employed in the Methods is shown in Tables 2–3 for SNVs and insertion/deletions, respectively. The list of SNVs and insertion/deletions satisfying biological filters 1–6 is shown in Table 4. The final candidate variants passing biological filters 1–6 and manual inspection include 67 SNVs and 1 deletion. The sets of variants that A) segregate with all aneurysmal phenotypes and B) are not carried in unaffected individuals are included in Table 4 as subsets of the 68 final variants. The limitations of only considering these sets of variants are described in the Discussion.

Download:

Table 2. Single nucleotide variant filtering pipeline.

https://doi.org/10.1371/journal.pone.0121104.t002

Download:

Table 3. Insertion deletion filtering pipeline.

https://doi.org/10.1371/journal.pone.0121104.t003

Download:

Table 4. Candidate variants identified through whole exome sequencing in 7 multiplex families.

https://doi.org/10.1371/journal.pone.0121104.t004

Of the 68 retained variants, five variants (found in the genes GSTCD, DUSP16, LMBR1L, HAL, and TSC2) were found in definite IAs in two families; in all of these cases, the variant segregated fully with definite IA in only one family. Two other variants (found in the genes COL17A1 and FOXM1) were the only variants of the 68 retained variants that were labeled with vascular-related Gene Ontology annotations (i.e. GO:0005604 basement membrane and GO:0005581 collagen; and GO:0001570 vasculature development and GO:0001570 vasculogenesis; respectively).

Linkage

The distribution of genome-wide LOD scores for each family is depicted in Figs. S1, S2, S3, S4, S5, S6, and S7, with the WES variants satisfying biological filters 1–6 superimposed. The maximum possible LOD score for each family given the model parameters and the specific pedigree structure is also reportedThe highest LOD score obtained by linkage markers within a 10Mb window centered on each sequencing variant is recorded in Table 4. Of the 68 WES variants satisfying biological filters 1–6 and manual inspection, 23 variants had a LOD score for a linkage marker within 10Mb of the sequencing variant that fell within 0.01 of the highest possible LOD score for that family.

The 23 variants within a possible linkage peak were distributed among all families except family F, where the highest LOD score for a linkage marker within 10Mb of a filtered sequencing variant was 0.83 but the highest possible LOD score for the family was 1.12. Family B had the most retained variants within possible linkage peaks (n = 9); followed by family D (n = 4); families A, E, and G (n = 3); and family C (n = 1). Of the 23 variants, only 8 also met the optional prioritization criteria of segregating with all aneurysmal phenotypes and not being carried by an unaffected individual (KLF11 variant in family A, variants in ABCC3 and TANC2 in family B, variants in ALMS1 and ARHGEF17 in family D, and variants in SMEK2, HTRA2, and NDST1 in family E).

While none of the 68 variants coincided with well-established GWAS association signals, 6 of the variants were found within IA linkage peaks identified in previously published family studies, independent of the families in this report. Four variants (found in the genes C1orf38, PTAFR, ZNF362, and MAP7D1) fell within the linkage peak 1p34.3–36.13 [7, 8], while 2 variants (found in the genes ROBO3 and FOXRED1) were located in the linkage peak 11q24–25 [10–12]. None of these 6 genes were suggested as candidate genes by the authors of the published linkage studies. The linkage regions each cover hundreds of genes, as they span approximately 24 and 14 Mb respectively.

RNA Expression

Expression data was obtained in 51 of the 68 candidate genes in an independent set of IA cases and controls. Log fold changes and FDR-adjusted p-values for each gene is displayed in Table 4. Only 1 gene (TMEM132B) of the 51 genes showed differential expression (overexpressed with log fold change = 2.63, FDR-adjusted p-value = 0.023).

Discussion

TMEM132B

Exome sequencing presents an opportunity to explore the contribution of rare variation to complex disorders like IA. We have used this approach to identify 68 rare variants in 68 genes that segregate within 7 densely affected families. Of the 51 genes that were expressed in IA tissue, one gene (TMEM132B) was found to be significantly overexpressed in IA tissue in comparison to control vascular tissue.

TMEM132B, or transmembrane protein 132B, is a relatively uncharacterized protein of unknown function. The variant segregating in the family is rare (0.024% frequency in European American samples in the Exome Sequencing Project and not found in 1000 Genomes) and predicted damaging by SIFT, PolyPhen-2, and CADD due to a change from the polar amino acid serine to the nonpolar amino acid leucine at a highly conserved position. Each of the individuals with a definite IA in family D was heterozygous for the variant. Mutations inherited in a dominant manner often lead to a disease phenotype through a gain of function mechanism, which would be supported by the overexpression of TMEM132B in IA tissue as compared to control vessels. It is also possible, however, that dominantly-inherited mutations exert their effect via haploinsufficiency or dominant negative mechanisms. Further studies are required to confirm the role of TMEM132B in IA and through what mechanism the variant identified in this study may act.

The TMEM132B variant was not inherited by individual 11 in family D. Individual 11 was diagnosed as a possible IA due to the presence of a small aneurysm identified through non-invasive imaging (i.e. 1–2mm, verified by 3 independent neurologists). This is in contrast to the definite IAs clearly identified in this individual’s sibling and cousins; thus, individual 11 is most likely actually unaffected.

Prioritization of Variants within Families

Expression information was only available for 51 of the 68 candidate genes; thus, RNA expression cannot confirm or rule out the role of the remaining 17 genes in IA pathophysiology. Additionally, a subset of the other 50 variants with expression data may also contribute to IA in ways not captured by the RNA expression experiment and should be explored. In order to further study the cause of IA in each of the remaining families, candidate variants in each family must be prioritized.

In families C and E, segregation analysis reduced the number of prioritized variants to only 2 and 4 variants, respectively. For family C, the two variants have a CADD score >20. The variant in SEC16B is not found within a potential linkage peak; however, in support of its potential significance in disease susceptibility, it is not carried by any tested unaffected family member. The variant in C11orf65, on the other hand, is found within a potential linkage peak but is also inherited by an unaffected family member. For family E, three variants segregate in the family (and a fourth variant in GSTCD is found in only one individual in family E but segregates fully in family G). The three variants that segregate in the family (in genes SMEK2, HTRA2, and NDST1) all are found within potential linkage peaks. Data are not available from any unaffected family members. Therefore, further prioritization among these three variants could incorporate CADD scores, which range from 11.98 for the HTRA2 variant to 23.9 for the SMEK2 variant.

Considerations for Pedigree and Phenotypic Data

It is possible that genetic heterogeneity, phenocopies, or gene-environment interactions could explain one or more IAs in the families chosen for this study. In this case, the criterion requiring all affected individuals to share a variant would miss important disease-contributing variants. Similar family-based sequencing studies in the future could relax this segregation criterion with the recognition that a much larger number of variants will be retained. Family-based aggregative association tests that incorporate different penetrance models could also be employed with a larger number of samples.

The availability and quality of clinical data is also critical to consider in complex disease WES studies. In this study, several families also had individuals with probable and possible IAs (see Table 1 for phenotype definitions), and one family also had an occurrence of an abdominal aortic aneurysm (Fig. 1). Given the high density of definite IAs in these families, it is likely that some or all of the probable and possible IAs have disease due to the same disease-contributing variant. Additionally, given the possible genetic link between different forms of aneurysms [37], the abdominal aortic aneurysm may also share the same genetic etiology within that family. We thus flagged variants that segregated fully among all individuals with an aneurysm (definite, probable, or possible IA, or an abdominal aortic aneurysm) (Tables 2–3). This represents a possible method for prioritizing variants for further study, with the caveat that including non-definite IAs increases the likelihood of genetic heterogeneity, phenocopies, and gene-environment interactions.

Another approach to prioritize variants for further study is to utilize genotypic data from unaffected individuals. The ability of this approach to rapidly narrow down the number of variants under consideration is readily apparent from this study (Tables 2–3), but there are major concerns about inflating false negative rates by using unaffected individuals. Given the traditionally late age of onset for intracranial aneurysms, only individuals who had an MRA confirming the absence of IA at age 45 or older were sequenced as unaffected samples. Despite these precautions, the unaffected individuals in this study were still relatively close in age to the age at diagnosis of their relatives who had an IA, and it is possible that the unaffected individuals will actually develop an IA later in life due to a genetic predisposition.

The difficulty in defining an unaffected also surfaces when considering the putative obligate carriers in these families. In Family A, individual A-7 also had an MRA done at age 64 that excluded the presence of an IA, yet we would posit that this individual likely passed a causative genetic variant to her daughter (A-10), whose IA is more likely to have a genetic basis due to her young age of onset. Without the daughter’s data, individual A-7 would have likely been chosen as an unaffected individual for sequencing, especially given that she had major environmental risk factors (a history of smoking and hypertension). In Family E, the sequenced individual E-9 is also an obligate carrier under our model. Unlike individual A-7, an MRA could not be obtained on individual E-9, and she did not have a history of smoking or hypertension. Since all affected individuals in family E had at least one environmental risk factor and individual E-9 did not, it is possible that the causative genetic variant in family E requires an additional environmental insult to lead to IA development. The importance of strong environmental risk factors such as smoking to the development of aneurysms, even in the context of rare causal genetic variants, cannot be underestimated. Alternative methods of prioritization of variants that incorporate this possibility should be explored. Thus, unaffected status in this study was used as a mechanism for possible prioritization but not for automatic exclusion of variants. The ability to use unaffected individuals will vary in studies of different diseases and will likely be more fruitful in those diseases that appear to have a smaller environmental/lifestyle contribution.

For future family-based sequencing studies in complex disease, it may not be feasible to sequence as many individuals per pedigree as was done for this study. Thus, it is critical to carefully select samples based on the quality of phenotyping and the pedigree structure. Recently developed tools offer statistical methods to select related subjects for sequencing based on genetic distance [38], samples that span multiple generations (Exome Picks, http://genome.sph.umich.edu/wiki/ExomePicks), and a combination of both of these methods [39]. As evident from Tables 2–3, selecting families with more closely related individuals, such as families with full siblings as in Families F and G, will yield a smaller number of initially called variants across the family. Yet, the power to narrow down the number of variants segregating with disease is diminished in such families due to the naturally larger percentage of alleles shared, as compared to families with individuals in multiple generations such as in Family C. Thus, where possible, selection of more distantly related family members for sequencing studies will have greater power to generate a narrowed list of prioritized variants.

For some families, it may be possible to combine linkage and sequencing data to find causative variants. The families sequenced in this study were included as a part of a larger linkage study reported previously [9]. The same model parameters used for WES variant filtering was applied for multipoint linkage analysis. Since any given marker may have been uninformative for a family, a maximum LOD score was reported within a 10Mb window of the sequence variant’s chromosomal position. Although only modest evidence of linkage was obtained, several sequencing variants lay within the linkage regions in these families (S1, S2, S3, S4, S5, S6, and S7 Figs.). Many variants, however, did not overlap with any evidence of linkage, suggesting that these families were either not fully informative for robust linkage analysis near these loci, or the sequencing variants identified are not causative genetic variants in these families.

Considerations for Exonic Variation

In recent years, WES has emerged as a practical method for systemically exploring rare coding variation. Since the majority of known genetic causes of Mendelian disorders affect protein coding regions [40], the exome is a logical starting place to identify potentially causative variants in diseases that exhibit Mendelian inheritance. The densely-affected families sequenced in this study appear to display autosomal dominant inheritance; therefore, we hypothesized that coding variants may explain most or even all of these cases.

Due to imperfect capture and alignment, WES generates some off-target, non-exonic variant calls. While it is possible that important variation exists in these off-target regions, a higher percentage of calls in these regions are of poorer quality. Thus, only those variants within exonic or splicing regions were retained in this experiment. Since different databases contain different numbers of and boundaries for genes and exons [41], a consensus prediction of gene and exon boundaries was made to determine those variants that fell within exonic or splicing regions. In order to minimize the type I error rate by using functional predictions of the highest confidence, the intersection of functional predictions from three different databases (RefSeq, UCSC, and Ensembl) was used for this study. Thus, variants were only retained if they were predicted by all three databases to be within exonic or splicing regions. Other WES studies may choose to generate a larger set of variants by prioritizing all variants in the union rather than the intersection of functional predictions from multiple databases; however, appropriate methods for validating variants with functional predictions that differ by database should be employed.

It is possible that non-coding variants and/or epistatic interactions are important in IA development in these families and in other complex diseases, in which case alternate study designs should be utilized. At the time of this study, whole genome sequencing could have only been employed at the expense of sequencing fewer individuals, and annotations and bioinformatics tools available for non-coding sequence were less robust. Given that whole genome sequencing generates about 3 million SNVs per genome [42], annotations and bioinformatics tools are even more critical for practical prioritization of candidate variants. In the future, techniques like whole genome sequencing, as well as targeted resequencing, transcriptome sequencing, and other high throughput study designs, can be applied to fully catalogue the role of genetic variation in IA development.

Considerations for Allele Frequency

The average individual has around 15,000 exonic single nucleotide variants (SNVs) differing from the reference human genome sequence [24]. In order to narrow down the number of variants identified by a WES study, initial studies [43, 44] focused on rare diseases and limited analysis to novel variants. This strategy is too restrictive for more common diseases such as IA. In the particular subset of families used for this study, there is a uniquely high incidence of IA, which enriches for the possibility of identifying rare, highly penetrant variants of larger effect sizes. Rare variants and less common variants are typically defined as less than 1% and 1–5% minor allele frequency, respectively [45, 46]. Given the rarity of families that are as densely affected as the ones in this study, a 1% minor allele frequency threshold was set. It is possible, however, that a variant of higher minor allele frequency causes IA in one or more of these families. Future studies with a much larger sample size could employ aggregative association tests [47] with relaxation of the allele frequency threshold.

In this study, allele frequencies specifically from European American populations were available from public databases. Given that rare variants can be population-specific [48], the selection of appropriate allele frequency databases is critical. In lieu of publicly available allele frequencies, future studies may consider sequencing a large number of internal controls and possibly requesting commonly available controls to sequence as well. While not feasible for the current study, such a design would help control for platform- and pipeline-specific artifacts in sequencing while ensuring phenotyping quality for controls.

While it is standard for WES studies to utilize public databases to filter variants, it is also valuable to use internal frequency databases that are specific to the sequencing and variant calling pipeline. Because variant calling can be lab-specific due to the technology used, in this study variants were annotated for binned minor allele frequencies from 290 unrelated samples without a known cardiovascular phenotype that were exome sequenced at CIDR. Thus variants that would have otherwise been considered rare or novel when compared against public databases, but that were actually a recurring artifact of the sequencing, were captured as having a high CIDR binned minor allele frequency. Given that the bioinformatics pipeline used in this study differed slightly from that of the internal database, the internal database filter may have missed some artifacts specific to the variant calling method. Variants that were monomorphic (i.e. all heterozygous or homozygous for the alternate allele) across all samples were also removed since it is highly unlikely that the identical rare disease-causing allele would be shared by both affected and unaffected individuals in multiple families.

Insertion/deletion allele frequencies in both internal and external databases are inherently less accurate than frequencies for SNVs, due to the increased difficulty and variation in calling structural variants. Also, differences in how position coordinates are assigned as well as reference and alternate allele designations further makes comparison challenging. The 26 insertion/deletions that passed biological filters 1–6 (described in the Methods) in all cases except for one were shared in almost all or all of the 7 families sequenced in this study. Just as variants that were monomorphic across all datasets were removed as probable sequencing or pipeline artifacts, it is very unlikely that any given rare disease-causing insertion/deletions would also be shared across all or almost all families in a complex disease. It is possible that multiple families may carry different disease-causing insertion/deletions in the same gene, but this pattern was not seen. Thus, a second internal frequency comparison set of 500 samples that had a more similar bioinformatics pipeline to the IA samples sequenced in this study (i.e. use of GATK Unified Genotyper for variant calling) was used for manual review in combination with IGV visual inspection for the 26 insertion/deletions remaining after application of biological filters. Manual review as described in the Methods excluded all but one of the 26 insertion/deletions, demonstrating that manual inspection and use of an internal dataset generated by a similar bioinformatics pipeline are critical for reviewing insertion/deletions in sequencing experiments. Future studies may also consider utilizing newer local re-assembly-based methods for variant calling, such as FreeBayes [49] or GATK’s HaplotypeCaller, which may improve the accuracy of insertion/deletion calls.

Consideration for Functional Predictions of Exonic Variation

More severe amino acid substitutions are more likely to present clinically [40], so most WES studies to date have focused on non-synonymous SNVs and insertion/deletions. In this study, we also opted to focus on these variants, as predicted by the intersection of the three gene databases (RefSeq, UCSC, and Ensembl). Future studies focused on exonic variation could also study the effect of synonymous variation, which has been shown to also play an important role in human disease [50]. At the time of this study, fewer validated tools existed to examine the role of synonymous variation in sequencing data.

In this study, several programs were used to measure the level of conservation of a locus and the predicted pathogenicity of a variant. The programs have varying degrees of sensitivity and specificity for different kinds of variants, particularly due to the use of different but not completely independent data sources when generating predictions [51]. The bioinformatics community is working to develop tools that will be able to better integrate information to provide a more informed pathogenicity prediction. One such tool, the CADD program [31], was recently introduced but has not been applied to a large number of datasets. Since there are few published studies implementing CADD, we have conservatively removed only variants with a C-score <10, thus retaining variants that are predicted by CADD to be among the 10% most deleterious substitutions in the human genome.

Considerations for Biological Processes and Pathways

The filtering schema did not employ any assumptions about biological processes or pathways. Variants were annotated for GO terms chosen for possible relation to IA formation; however, only two variants in the final candidate variant list (variants found in the genes COL17A1 and FOXM1) had one or more of these GO annotations. While using GO annotations as a filter is a powerful method for narrowing a list of variants, such an approach would depend on the comprehensiveness of GO annotations, as well as the reliability of investigator-chosen GO terms. To avoid subjectivity in selecting biological processes or pathways, future studies with larger sample sizes should consider employing formal gene set enrichment analysis, which eliminates the need to choose pathways a priori. Even for smaller datasets, use of GO annotations may help determine which gene variants to pursue first in additional experiments to explore possibly causal associations between the variant and disease of interest.

Summary and Future Directions

This is one of the few studies published to date that apply WES in a cohort of well-characterized families densely affected with a common complex disease without an a priori focus on a particular pathway or genomic region. We have laid out many considerations for future WES studies in complex disease, including the use of pedigree and phenotypic data, defining gene and exon boundaries, sources for allele frequency estimates, proper interpretation of in silico functional predictions, the role of environmental factors in the determination of potentially causal rare variants, and the possible utility of combining pathway information with sequencing data.

In this study, 68 rare exonic variants in 68 genes were identified. Of these genes, one gene (TMEM132B) was significantly differentially expressed in IA versus control tissue. Further studies are needed to confirm and explore the TMEM132B variant, as well as the possible contribution of the other 67 variants. Replication and/or meta-analysis with similar sequencing studies using larger sample sizes could be used to gather further evidence for specific genes on this list. Additionally, a subset of these variants, which can be prioritized through any of the methods discussed in this study, could be explored through functional studies in models where vascular phenotypes can be easily observed, such as zebrafish. Targeted gene editing, such as through the CRISPR-Cas system, could help test whether a given variant disrupts the normal functioning of the relevant gene and whether such a disruption leads to a phenotype of interest. Ultimately, such a model should also enable investigation of whether the disrupted phenotype can be rescued by reintroduction of the wild type allele or interference with the variant allele. For comprehensive exploration of the variants identified in this study, multiple methods of experimental validation may be necessary. This study represents a necessary first step in the evaluation of role of rare variants in a common complex disease. Further evaluation in other familial and sporadic samples, as well as multi-ethnic samples, will be essential.