Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Comparison of sequencing methods and data processing pipelines for whole genome sequencing and minority single nucleotide variant (mSNV) analysis during an influenza A/H5N8 outbreak

  • Marjolein J. Poen,

    Roles Data curation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing

    Affiliation Erasmus MC, Department of Viroscience, Rotterdam, the Netherlands

  • Anne Pohlmann,

    Roles Data curation, Formal analysis, Writing – original draft, Writing – review & editing

    Affiliation Institute of Diagnostic Virology, Friedrich-Loeffler-Institute, Insel Riems, Germany

  • Clara Amid,

    Roles Resources, Software

    Affiliation European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom

  • Theo M. Bestebroer,

    Roles Methodology

    Affiliation Erasmus MC, Department of Viroscience, Rotterdam, the Netherlands

  • Sharon M. Brookes,

    Roles Writing – review & editing

    Affiliation Animal and Plant Health Agency (APHA) - Weybridge, Addlestone, Surrey, United Kingdom

  • Ian H. Brown,

    Roles Writing – review & editing

    Affiliation Animal and Plant Health Agency (APHA) - Weybridge, Addlestone, Surrey, United Kingdom

  • Helen Everett,

    Roles Writing – review & editing

    Affiliation Animal and Plant Health Agency (APHA) - Weybridge, Addlestone, Surrey, United Kingdom

  • Claudia M. E. Schapendonk,

    Roles Methodology

    Affiliation Erasmus MC, Department of Viroscience, Rotterdam, the Netherlands

  • Rachel D. Scheuer,

    Roles Methodology

    Affiliation Erasmus MC, Department of Viroscience, Rotterdam, the Netherlands

  • Saskia L. Smits,

    Roles Data curation, Methodology

    Affiliation Erasmus MC, Department of Viroscience, Rotterdam, the Netherlands

  • Martin Beer,

    Roles Supervision, Writing – review & editing

    Affiliation Institute of Diagnostic Virology, Friedrich-Loeffler-Institute, Insel Riems, Germany

  • Ron A. M. Fouchier,

    Roles Funding acquisition, Supervision, Writing – review & editing

    Affiliation Erasmus MC, Department of Viroscience, Rotterdam, the Netherlands

  • Richard J. Ellis

    Roles Data curation, Formal analysis, Supervision, Writing – original draft, Writing – review & editing

    richard.ellis@apha.gov.uk

    Affiliation Animal and Plant Health Agency (APHA) - Weybridge, Addlestone, Surrey, United Kingdom

Abstract

As high-throughput sequencing technologies are becoming more widely adopted for analysing pathogens in disease outbreaks there needs to be assurance that the different sequencing technologies and approaches to data analysis will yield reliable and comparable results. Conversely, understanding where agreement cannot be achieved provides insight into the limitations of these approaches and also allows efforts to be focused on areas of the process that need improvement. This manuscript describes the next-generation sequencing of three closely related viruses, each analysed using different sequencing strategies, sequencing instruments and data processing pipelines. In order to determine the comparability of consensus sequences and minority (sub-consensus) single nucleotide variant (mSNV) identification, the biological samples, the sequence data from 3 sequencing platforms and the *.bam quality-trimmed alignment files of raw data of 3 influenza A/H5N8 viruses were shared. This analysis demonstrated that variation in the final result could be attributed to all stages in the process, but the most critical were the well-known homopolymer errors introduced by 454 sequencing, and the alignment processes in the different data processing pipelines which affected the consistency of mSNV detection. However, homopolymer errors aside, there was generally a good agreement between consensus sequences that were obtained for all combinations of sequencing platforms and data processing pipelines. Nevertheless, minority variant analysis will need a different level of careful standardization and awareness about the possible limitations, as shown in this study.

Introduction

Over the past decade, high-throughput sequencing technologies have evolved, providing faster, cheaper, and less laborious alternatives to obtain (whole genome) DNA and RNA sequences compared to traditional Sanger sequencing [1, 2]. The use of next-generation sequencing (NGS) technologies is continuously expanding and has revolutionized the field of genomics and molecular biology.

In many fields of infectious disease research, nucleotide changes in DNA or RNA sequences are used to monitor genetic adaptions indicative of evolution, the emergence of drug resistance, immune evasion or as a tool in epidemiological tracing [3]. In clinical settings, sequencing information is used to improve diagnostics and prognosis. NGS technologies play an increasingly important role in these processes as clinically or epidemiologically important nucleotide changes can be present in the minority of DNA or RNA sequences only, which might be missed with more traditional (consensus) sequencing methods which determine the most abundant sequence variants in a population. Nucleotide variants that are present in only a minority of the sequenced virus population are referred to as minority Single Nucleotide Variants (mSNVs). These variants, initially occurring due to replication errors, can become fixed in the population when they have some sort of evolutionary advantage, for instance, mutations related to drug resistance. Furthermore, mSNVs can be also used for high-resolution molecular epidemiology, which becomes more and more important for outbreak assessment [4, 5]. Traditional Sanger sequencing for instance has been described to detect minority variants provided they are present in at least 10% of the analysed DNA or RNA strands within a sample [6, 7]. Hence, the use of traditional sequencing methods is usually restricted to obtaining consensus sequences or to determine heterozygosity in diploid organisms. In contrast, NGS technologies are able to detect low frequency mSNVs in sequence fragments or even whole genomes. Typically, NGS sensitivity for minority sequence variant identification is restricted to a level of variation of 0.1–1%, mainly due to sequencing related background errors [810], but sensitivity can be increased using sophisticated approaches like circle sequencing [11] or improved bioinformatic analysis workflows [10]. The reliability of mSNV analysis using NGS methods is influenced by many factors, like the quantity and quality of the input sample, the laboratory procedures, the type of sequencing platform and the software and settings used to analyse the raw sequence data.

Due to the technical improvements, NGS technologies have become more important as diagnostic tools to characterize pathogens in outbreak situations. However, the increasing use of these technologies to address new and important (outbreak related) research and surveillance questions emphasizes the need to determine the reproducibility of, and the important technical considerations affecting, outcomes obtained by different laboratories following different protocols. Given this, comparative studies focusing on different platforms and data analysis methods are essential to cross-validate different methodologies and determine the reliability of newly obtained data. In addition, there is a growing need (as exemplified by the recent Ebola and Zika virus outbreaks) to share also comprehensive sequencing data as quickly as possible to help with source attribution and developing control strategies. However, the underlying technologies and methods used for NGS are still diverse and there is a strong demand for harmonization of laboratory procedures and approaches for a reliable and optimized analysis of the data.

This study is part of the European Union’s HORIZON 2020 project “COMPARE” (http://www.compare-europe.eu/), aiming to improve the analytical tools for emerging zoonotic pathogens and its underpinning research. Here, the comparability of NGS output data obtained from different sequence approaches were evaluated and demonstrated suitable sharing strategies for comprehensive NGS data sets. In November 2014, a newly emerging strain of highly pathogenic avian influenza (HPAI) virus was detected in several European countries [12, 13]. In the United Kingdom [14], Germany [15], and The Netherlands [1618] this subtype was detected in commercial poultry farms within a few days of one another. In each of those countries, NGS was used to generate whole-genome sequences rapidly after detection, but as the laboratories in each country were working independently, different approaches were used for both sequencing and data analysis, and the data were shared as part of a wider study to determine the likely source of the outbreak [19]. It is important to determine whether the different analytical approaches have any impact on the outcome. Therefore, the aim of this study was to determine how comparable consensus and minority variant results were between laboratories performing their standard analyses, and whether discrepancies could be attributed to the sequence platform (SP), the data processing platform (DPP) or a combination of both. With the lack of a ground truth/gold standard, all datasets obtained were compared amongst each other. The hypothesis we test in this study is that outputs from NGS analysis of viruses will be comparable irrespective of laboratory, sequencing platform and data analysis platform.

Therefore, virus isolates obtained in each of the three countries (United Kingdom, Germany and the Netherlands) were shared between these three partners and subsequently sequenced and analysed in each of the three laboratories according to local procedures. In addition, the use of a specially designed data sharing platform, a COMPARE “Data Hub” at EMBL-EBI, Hinxton UK, was evaluated. This study presents genome coverage data, consensus sequences, the analysis of the comparability of mSNV identifications of the different SPs, and DPPs.

Our hypothesis was confirmed at the consensus sequence level, since consensus sequences could be reproduced independent of the combination of SP and DPP used. However, the identification of minority variants appeared to be poorly reproducible, primarily due to the well-known errors in 454 sequencing, and due to differences induced by the alignment processes in the different DPPs. The interpretation of minority variant analysis thus needs a different level of careful standardization and awareness about the possible limitations as shown in this study.

Materials and methods

Experimental design

Three avian influenza A virus isolates that were obtained from three different avian species during the 2014/15 outbreak of HPAI H5N8 virus in Europe were shared among three institutions in the United Kingdom (Animal Plant and Health Agency [APHA]), Germany (Friedrich-Loeffler-Institut [FLI]) and the Netherlands (Erasmus Medical Center [EMC]), later referred to as anonymized institutions I, II and III (Fig 1). All three institutions sequenced all three virus isolates according to their own standard procedures. Adaptors used in the sequencing processes were trimmed off before the raw sequence data files were shared. The sequence data files (*.fastq files), alignment files (*.bam files), sample metadata and experimental metadata were shared between the three laboratories and analysed in their own DPPs yielding sequence datasets for each virus (Table 1). This approach enabled to separate the biological features of the viruses from variation introduced by technical methodology. Data sharing was facilitated via a “Data Hub” provided by the EMBL-EBI’s European Nucleotide Archive (ENA) in the framework of the COMPARE collaborative project; all data were stored and subsequently published in ENA [20] (https://www.ebi.ac.uk/ena, for the accession numbers, see Table 1). ENA is an open repository for sequence and related data and a member of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) [21]. A full description of the COMPARE Data Hub system is provided in a preprint version of Amid et al. [22]. First, consensus sequences derived from a preliminary analysis were compared and one overarching consensus sequence was determined for each gene segment for each virus. This custom-made consensus was used by all three institutions as the reference genome for undertaking mSNV analysis. The resulting nine mSNV reports (originating from three whole-genome raw data sequences times three DPPs) were combined for all three viruses in one spreadsheet file per virus to check the reproducibility of mSNV identification when using different combinations of SP and DPP. The experimental design is summarized in Fig 1.

thumbnail
Fig 1. Flowchart of the experimental design.

SP: sequence platform; DPP: data processing pipeline.

https://doi.org/10.1371/journal.pone.0229326.g001

Samples

All samples were obtained from outbreaks in commercial poultry holdings. Isolate A/duck/England/36254/2014 was obtained from pooled intestinal material from index case ducks (Anas platyrhynchos domesticus). Tissue homogenate material was inoculated into embryonated chicken eggs and allantoic fluid was harvested at 1 day post-inoculation [14]. The Dutch isolate (A/chicken/Netherlands/EMC-3/2014) was obtained by passaging lung material of a dead commercial layer hen (Gallus gallus domesticus) in MDCK cells twice and harvesting the supernatant after approximately 40 hours post-inoculation [23]. The German isolate (A/turkey/Germany/AR2485/2014) originated from lung tissue of a commercially kept turkey (Meleagris gallopavo) and was passaged in embryonated chicken eggs [15]. (Table 1).

Sequencing

Institution I: SP1.

RNA was extracted using a Qiagen QIAamp viral RNA mini kit (Qiagen, Germany) according to the manufacturers’ instructions except that carrier RNA was omitted from the AVL lysis buffer and the sample was eluted in 50μl RNAse-free water. RNA was then processed to double-stranded cDNA (cDNA Synthesis System, Roche) using random hexamers and purified using magnetic beads (AmpureXP, Beckman Coulter, USA). The double-stranded cDNA was diluted to 0.2 ng/μl and used to produce a sequencing library using the NexteraXT kit (Illumina, USA). Libraries were then sequenced in paired-end mode on an Illumina MiSeq (Illumina, USA), with run lengths varying from 2 x 75 bases (UKDD virus) to 2 x 150 bases (NLCH and DETU viruses) depending on whether time-constraints were implemented to provide a rapid response to an outbreak. Demultiplexing and removal of sequencing adapters was done by the MiSeq RTA software to generate raw fastq files. SP1 process included a limited 12-cycle PCR enrichment of the library. Post-hoc analysis showed that duplication levels were less than 0.02% of the total reads which were considered to have negligible impact on the results.

Institution II: SP2.

RNA was extracted using a combined approach with TRIzol (Thermo Fisher Scientific, USA) and an RNeasy Kit (Qiagen, Germany). Further concentration and cleaning was done with Agencourt RNAClean XP magnetic beads (Beckman Coulter, USA). RNA was quantified using a Nanodrop UV spectrometer ND-1000 (Peqlab, Germany) and used as template for cDNA synthesis with a cDNA Synthesis System (Roche, Germany) with random hexamers. Fragmentation of the cDNA applying a target size of 300 bp was done with a Covaris M220 ultrasonicator. The sonicated cDNA was used for library preparation using Illumina indices (Illumina, USA) on a SPRI-TE library system (Beckman Coulter, USA) using a SPRIworks Fragment Library Cartridge II (for Roche FLX DNA sequencer; Beckman Coulter, USA) without automatic size selection. Subsequently, upper and lower size exclusion of the library was done with Ampure XP magnetic beads (Beckman Coulter, USA). The libraries were quality checked using High Sensitivity DNA Chips and reagents on a Bioanalyzer 2100 (Agilent Technologies, Germany) and quantified via qPCR with a Kapa Library Quantification Kit (Kapa Biosystems, USA) on a Bio-Rad CFX96 Real-Time System (Bio-Rad Laboratories, USA). SP2 did not amplify sample nor library. Sequencing was done on an Illumina MiSeq using MiSeq reagent kit v3 (Illumina, USA) resulting in paired end sequences with a read length of 300. Demultiplexed and adapter-trimmed reads were used to generate raw fastq files.

Institution III: SP3.

RNA was extracted using the High Pure RNA isolation kit (Roche Diagnostics, Germany) according to manufacturer’s instructions. RNA was converted to cDNA using the SuperScript III Reverse Transcriptase kit (Invitrogen, Thermo Fisher, USA) as described previously [24], and amplified by PCR using primers covering the full viral genome (S1 Table). All 32 PCR fragments from approximately 400–600 nucleotides in length, were sequenced using the 454/Roche GS-FLX sequencing platform. The PCR fragments were pooled in equimolar ratio and purified using the MinElute PCR Purification kit (Qiagen, Germany) according to the manufacturer’s instructions. Rapid Library preparation, Emulsion PCR and Next Generation 454-sequencing were performed according to instructions of the manufacturer (Roche Diagnostics, Germany). Protocols are described in the following manuals: Rapid Library Preparation Method Manual (Roche; May 2010), emPCR Amplification Method Manual–Lib-L (Roche; May 2010) and Sequencing Method Manual (Roche; May 2010). All three samples were sequenced in one run. Samples were pooled using MID adaptors to determine which sequences came from which sample, each sample was assigned two different MID’s. Demultiplexing and basic trimming was done by CLC-bio software to generate raw fastq files (S1 File).

Data processing

Institution I: DPP1.

In the FluSeqID script (https://github.com/ellisrichardj/FluSeqID) the following steps are run automatically: the mapping of raw sequence data to the host genome (BWA v0.7.12-r1039 [25]), extracting reads that do not map to the host (Samtools v1.2 [26]), assembling non-host reads (Velvet v1.2.10 [27]), identification of the closest match for each genome segment (BLAST v2.2.28 [28] using the custom databases generated from the Influenza Research Database as indicated in the GitHub repository), mapping original data to the top reference segments (BWA), calling new consensus sequences (vcf2consensus.pl), performing further iterations of the last two steps to improve new consensus (IterMap), and finally outputting the genome consensus sequence. The data processing pipeline has in-build defaults for k-mer and coverage cut-off for de novo assembly, and the e-value cut-off for BLAST, which can be changed via command line options (see https://github.com/ellisrichardj/FluSeqID). Since the aligner (BWA-MEM) used performs soft-clipping and ignores low quality data, quality trimming is unnecessary. For mSNV analysis, the reads were mapped to the unified consensus using BWA. Samtools was used to generate a pileup file which was then analysed using custom python and R scripts to determine the depth of coverage and basecalls at each position (available at https://github.com/ellisrichardj/MinorVar). The combination of BWA-MEM and samtools was shown to be accurate for SNV identification [29]. In order to be included in the final output the minimum basecall quality was 20 and the minimum mapping quality was 50.

Institution II: DPP2.

Raw sequence data were analysed and mapped using the Genome Sequencer software suite (v. 3.0; Roche, Mannheim, Germany) and the Geneious software suite (v. 9.0.5; Biomatters, Auckland, New Zealand). Raw reads were trimmed and subsets of each trimmed dataset were assembled de novo to generate reference sequences for each data set (Newbler Assembler of Genome Sequencer software suite v. 3.0; Roche, Mannheim, Germany). The trimmed raw influenza virus reads were mapped to the reference sequences (Newbler Mapper of Genome Sequencer software suite v. 3.0; Roche, Mannheim, Germany). The output assemblies were imported into the Geneious software suite (v. 9.0.5; Biomatters, Auckland, New Zealand) for further analysis and processing. Regions of low and high coverage (threshold was 2 x standard deviations from the mean for low and high coverage) and regions of low quality (minimum quality/phred score 20) were evaluated and if necessary, excluded from further analyses. Consensus sequences were generated and annotated using annotated reference sequences. Sequences were compared, and annotations that matched with a similarity (> 90%) were copied. This was done on nucleotide sequences and also for translation in all six reading frames. Annotations were manually inspected and curated. Trimmed raw reads of the datasets or subsets thereof were mapped to the consensus, mapping was fine-tuned and mSNVs were determined using generic SNP finder of the Geneious software suite, applying parameters of maximum p-value of 10−5 and filter for strand bias. The threshold for SNP identification was set at 1%, and variants were checked manually for accuracy.

Institution III: DPP3.

Raw sequence data were analysed and mapped using the CLC Genomics software package, workbench 8 (CLC Bio). Reads obtained by 454 sequencing were sorted by MID adaptor, quality-trimmed, and analysed using the parameters as shown in S1 File. In short, after sorting by MID, the sequence reads were trimmed at 30 nucleotides from the 3′ and 5′ ends to remove all primer sequences. Data from the shared Illumina sequence files had already been trimmed and were imported in CLC Bio without additional processing steps (S1 File). Reads were initially aligned to their own reference sequences that were uploaded during the H5N8 outbreak (Gisaid accession numbers EPI-ISL-169282 (NLCH), EPI-ISL-167904 (UKDD) and EPI-ISL-169273 (DETU)). Consensus sequences were automatically generated by CLC after alignment to the reference, for detailed settings see S1 File. For the mSNV analysis the raw data were mapped to the new custom-made consensus sequences per gene segment per sample. Fastq files of these alignments were shared with the other institutions. The threshold for mSNV identification was set at 1%, and registered minority variants were checked manually for accuracy (minimal quality/phred score 20).

Determining the influence of the DPP alignment steps versus DPPs mSNV identification methods.

Data processing pipelines process raw data in several steps, roughly divided into trimming, aligning data to a reference sequence, and variant calling (the mSNV identification procedure). In order to determine to what extent the trimming and subsequent alignment processes contributed to the observed differences the nucleotide coverage results obtained by the three DPPs when aligning the same SP raw datasets were compared. To study the influence of the mSNV identification process, quality-trimmed alignment files that had been generated by each DPP and shared as *.bam files were subjected to the mSNV identification process used in DPP3 to determine the differences in mSNV detection output when only the alignment processes differed. DPP3 was randomly picked for this analyses, mSNV detection parameters were set to the institutions default settings for mSNV identification using CLC-bio software and can be seen in the S1 File.

Data sharing

To test the applicability of real-time sequence data sharing within the COMPARE network, all raw sequence data used in this study were uploaded to and shared via a “Data Hub” in the environment of the European Nucleotide Archive (ENA). Each institution received its own study accession in which all raw sequence data files and metadata files were assigned with individual experimental accession numbers (Table 1). In addition to the sequence data, all trimmed alignment files (*.bam) have been uploaded to the ENA. Using these hubs, sharing between institutions was facilitated and immediate access to the data prior to the public release was possible to enable joint evaluation and comparison. All data files have been made publicly available via the ENA (https://www.ebi.ac.uk/ena).

Designing the custom-made consensus sequences

Each institution produced a consensus sequence for the 8 influenza gene segments (PB2, PB1, PA, HA, NP, NA, MP, NS) for each of the three viruses. The obtained consensus sequences were aligned using the BioEdit sequence alignment editor (version 7.2.0) [30]. Raw sequence data from each SP were initially aligned to their own reference sequences that were uploaded during the H5N8 outbreak (Gisaid accession numbers EPI-ISL-169282 (NLCH), EPI-ISL-167904 (UKDD) and xxx (DETU)).

mSNV analysis comparison

For the mSNV analyses the custom-made consensus for each virus isolate was used as a reference for mapping, thereby standardizing positions within the genome to make comparison between institutions easier. To avoid unnecessary increases in analytical time and memory, datasets were down-sampled to 100.000 reads per sample when needed. Each DPP produced a report on the identified mSNVs in a tabulated format. The analysis output files were filtered for mSNVs only, thereby ignoring detected nucleotide insertions and deletions (InDels). There is a current lack of data or evidence-based approaches on how to calculate the required sequence depth (i.e. coverage) for mSNV analyses an evidence-based. In this study, a minimum coverage threshold for the identification of mSNVs was applied. This minimum nucleotide coverage (i.e. number of reads per nucleotide after trimming) was determined using a basic sample size calculation method, n = log β / log p’ [31]. Here β represents the required power (e.g. for 95% chance of detecting something β = 0.05), and p’ is 1—the proportion of events that you want to detect. For a 95% certainty of detecting a variant at 1%, a minimum coverage of 298 reads per position is needed. For variants that occur in ≥5% of reads, the number of reads required is >58, and for variants that occur in ≥10% of the reads the minimum coverage is >28. For the mSNV identification literature commonly uses the mSNV cut-off frequencies of ≥10%, ≥5% and ≥1%. However, it needs to be noted that these cut-off values are arbitrary. Therefore, where depth of coverage was sufficient, this study will report mSNV detected with a frequency of ≥1%, but initial comparisons started with positions showing mSNVs with frequencies of ≥10% in at least one of the SP/DPP combinations, followed by those with mSNV of ≥5% -<10%, and lastly those ≥1%-<5%. For all those positions identified, the number of reads and number of variant nucleotides in all other SP/DPP combinations for that position will be noted regardless of frequencies.

Results

In order to determine the comparability of consensus sequences and mSNV identification the biological samples, the sequence data from 3 SPs and the *.bam quality-trimmed alignment files of raw data of 3 influenza A/H5N8 viruses were shared. All data sets were subsequently analysed in 3 different DPPs. The resulting 9 mSNV reports per virus (3 SP data sets each analysed in 3 DPPs) were evaluated for comparability of mSNV identification using different combinations of SP and DPP.

Data sharing

Data sharing using the COMPARE “Data Hub” provided by ENA proved to be easy, quick and successful. The “Data Hub” enables the File Transport Protocol (FTP) protected upload and download of large data files and facilitates sharing between collaborators with the possibility to evaluate and compare all data prior to their public release by generating and specifically sharing accession numbers using standard ENA procedures. The Data Hub used an influenza virus sample checklist. In addition, data sets are ultimately made publicly and through the INSDC network globally available and accessible in real-time as required without further upload to a different repository. Full details of the COMPARE Data Hub system are available in a submitted manuscript [22]. In summary, this process was suitable for quick data sharing in an outbreak scenario.

Designing the custom-made consensus sequences

For each of the 8 gene segments of the 3 viruses separately, 9 initial consensus sequences (3 SPs x 3 DPPs) were generated, resulting in 72 consensus sequences per virus. The custom-made consensus sequence per virus and per gene segment was 1) trimmed to a length represented by all 9 initial consensus sequences and 2) nucleotides had to be identical to at least 6/9 consensus sequences to be included. Although some sequences contained insertions or deletions, those could always be corrected for using the other SP sequences following the criteria mentioned previously. This resulted in a unique custom-made consensus for each gene segment for all three viruses.

Consensus sequences

When ignoring insertions and deletions in the homopolymer regions of the 454 data for most gene segments the identified consensus sequences were identical regardless of the SP and DPP combinations used with the exemption of the differences mentioned in Table 2. However, the number of insertions and deletions in homopolymer regions of the SP3 sequences were considerable in all 3 viruses. There was no clear difference in the number of insertions and deletions related to homopolymer regions between the different DPPs (20, 17 and 18 for DPP 1, 2 and 3 respectively). Nucleotide differences that were not related to homopolymer regions were only observed for sequences obtained in SP3 and SP2 data when processed in DPP1.

thumbnail
Table 2. The differences in consensus sequences obtained from each SP/DPP combination, sorted per virus and per gene segment.

https://doi.org/10.1371/journal.pone.0229326.t002

In summary, the homopolymer errors inherent in the 454 dataset caused problems for all DPPs, as expected. Consensus sequences generated by DPP1 from SP3 (454) data showed some unexpected differences, but it performed well with the SP1 data formats it was designed for and reasonably well with SP2 data. Overall, the consensus sequences can be reproduced by all DPPs using Illumina data but that the analysis of the 454 data from SP3 was more problematic, as it would require editing of the sequences at homopolymer regions. Consensus sequences from this study can be found in the S2 Table.

The mSNV analysis comparison

Nucleotide coverage and the influence of DPP-dependent alignment.

The observed number of reads per nucleotide (referred to as nucleotide coverage) differed depending on the SP/DPP combination. All DPPs handled both 454 and Illumina data formats, although some modifications (settings for the bwa mapper to handle single end 454 data) were required for DPP1, which was specifically designed for Illumina paired-end reads. The observed nucleotide coverages showed near to identical profiles for all three viruses. The coverage results obtained from the three different SPs and DPPs for the NLCH virus (Fig 2) and for the other two viruses (S1 Fig) were plotted. In general, lower nucleotide coverage was observed at the termini of each gene segment. The SP3 data showed more variation in nucleotide coverage within gene segments compared to SP1 and SP2 data, due to the sequencing of 32 PCR amplicons. The non-normalised number of raw sequence reads and influenza virus reads per virus per SP can be found in the S3 Table.

thumbnail
Fig 2. Nucleotide coverage.

The non-normalised nucleotide coverage displayed as number of nucleotides per position for full genome sequences of the NLCH virus reads mapped to the NLCH reference sequences. Panel A shows the coverage results for the same SP dataset in the three different DPPs (DPP1: purple; DPP2: orange; DPP3 grey) for each of the SP datasets. Panel B shows the coverage when the same DPP is used to analyse data from the three different SPs (SP1: lilac; SP2: yellow; SP3: green) for each of the DPPs. The X-axis represents the position in the genome, the Y-axis represents the number of sequence reads per position.

https://doi.org/10.1371/journal.pone.0229326.g002

The differences in nucleotide coverage were visualized for the three different SP raw datasets analysed with the same DPP (Fig 2A). Overall, SP3 data (green lines) showed a lower coverage compared to SP1 (purple) and SP2 data (yellow). The overall coverage for SP1 and SP2 data was similar with small variations for different viruses and DPPs. The shorter read lengths in SP1 virus data did not appear to have influenced the overall nucleotide coverage substantially.

The differences in nucleotide coverage introduced by different alignment procedures were also assessed, by comparing the coverage results for each SP raw dataset analysed with the three different DPPs (Fig 2B). DPP2 (orange lines) generally retained the highest nucleotide coverage for data from the different SPs. However, DPP3 (grey lines) generally also retained high coverage for SP3 data, for which it was optimized. The nucleotide coverage of SP3 data showed larger variation between the three different DPPs, leading to differences in nucleotide coverage up to 50% depending on the DPP, because DPP1 and DPP2 were not optimized for this SP. Data from SP2 were handled very similar by all three DPPs.

In conclusion, both the SP and the DPP influenced the number of reads per nucleotide position. SP3 showed the lowest output in number of reads compared to SP1 and SP2 Illumina data. The influence of the DPP depended highly on the data input, with best DPP performance for the SP dataset for which it was optimized.

The mSNV identification.

The mSNV identification thresholds were set to ≥1% in all DPPs. Because of the high number of mSNVs identified, the comparison of these mSNVs started with a manually set arbitrary threshold of ≥10% that was subsequently decreased to ≥5%, and ≥1%. A mSNV position was identified when at least 1 of the SP/DPP combinations showed a variant that exceeded the frequency threshold, and when the coverage at that position exceeded the minimum number of reads needed to detect that variant with a 95% probability, as described previously. The presence of mSNV and coverage for all SP/DPP combinations were compared for each of the positions in which a mSNV had been detected in at least one of the combinations. The coverages indicated for those positions where no mSNVs were detected were derived from the alignment files and were not subjected to possible additional read filtering parameters in the mSNV identification process. The average quality (Q-score/phred score) was set to or exceeding 20.

Ten positions across the three virus genomes were identified with mSNVs occurring in ≥10% of reads. Three of the mSNVs (NLCH:PB2 G1879A, NLCH:PB2 G2101A and DETU:HA T963C) were detected in all SP/DPP combinations but with slightly different relative abundance. The other mSNVs were identified in only one (n = 6) or two (n = 1) of the SP/DPP combinations (Table 3).

thumbnail
Table 3. The minority variants occurring in at least one of the sequence platform—Data processing pipelines as a ≥5% variant.

https://doi.org/10.1371/journal.pone.0229326.t003

Thirty-seven positions were identified with mSNVs occurring in ≥5% of reads. Of those, the same mSNV was identified in all SP/DPP combinations for 9 positions (24,3%), in seven or eight of the SP/DPP combinations for 2 positions (5,4%) and in at least two SP/DPP combinations for 19 positions (51.4%), although not always in a frequency of ≥5%. However, for 18 positions (48.6%) the mSNV was not reproduced at a ≥1% frequency in any of the other SP/DPP combinations (Table 3). Focussing on the separate SP data analysed in the 3 DPPs, most of the identified positions with ≥5% mSNVs in at least 1 SP/DPP combination were identified in SP1 data (47%) followed by SP2 (29%) and SP3 (24%) data.

Looking at the ≥5% mSNV reproducibility per SP dataset in all three DPPs within these thirty-seven positions, forty-eight SP datasets showed a ≥5% mSNV in at least one of the DPP outputs. Additionally, for eleven positions, all in the DETU virus, the variant was reproduced by all DPPs, however at a <5% frequency (for instance SP3 data at PB2.1054, and SP1 and SP2 data at NA.65) In 53% (31/59) of cases the same mSNVs from 1 SP dataset was reproduced in all three DPP’s in at least a ≥1% frequency, in 31% (18/59) of cases the variant was only detected in 1 DPP even though coverage in the other DPPs was theoretically high enough to detect variants at a 1% level.

Lowering the threshold value to a mSNV frequency of ≥1% resulted in a large increase in the number of positions identified with mSNVs. To investigate the reproducibility of these mSNVs, the data for all 3 viruses was combined per SP in the three DPPs (influence of DPP), and per DPP analysing data from the three SPs (influence of SP). The genome positions with ≥1% variants were listed per SP/DPP combination and entered in the program Venny 2.1 that calculated the overlapping positions as a fraction of the total number of positions between the SP/DPP combinations as compared to the total number of positions, resulting in Fig 3. It needs to be noted that especially SP3 did not always reach the minimum coverage requirements and may therefore not be suitable to detect low-frequency variants with (see also Table 4). Positions where the coverage in one or more of the nine SP/DPP combinations didn’t meet the minimum required coverage of 298 were not included in the comparison in Fig 3. The reproducibility of ≥1% variants using one SP dataset in all three DPPs was 10%, 9.4% and 31.1% for SP1, SP2 and SP3 sequences, respectively. The reproducibility of ≥1% variants using raw data of a virus sequenced in three different SPs was 20%, 7.4% and 22.6% for DPP1, DPP2 and DPP3 respectively (Fig 3). Most ≥1% variants were not reproduced by any of the other DPPs processing the same SP data (~75%) for SP1 and SP2 data. This was less for SP3 data but this might be due to the fact that many positions identified in SP3 data did not meet the minimum coverage criteria and were therefore discarded.

thumbnail
Fig 3. The reproducibility of ≥1% variants with sufficient coverage (>298) for all sequence data combined.

Each figure shows the number of ≥1% variants detected per sequence platform (SP, top row) and data processing pipeline (DPP, bottom row) for SP1/DPP1 (left column), SP2/DPP2 (middle column), and SP3/DPP3 (right column). The colours represent the different DPPs and SPs respectively, in which the >1% variants were detected: SP1/DPP1 (purple), SP2/DPP2 (yellow) and SP3/DPP3 (green). Positions with ≥1% variants that were identified in more than one of the SPs or DPPs respectively are displayed in the overlapping coloured areas, the centre part representing the number of ≥1% variants that were detected with all three DPPs (top row) or SPs (bottom row). The total number of positions with ≥1% variants detected was 250in SP1, 213 in SP2, 45 in SP3, and 50 in DPP1, 353 in SP2, and 93 in SP3. This figure was produced using Venny 2.1.

https://doi.org/10.1371/journal.pone.0229326.g003

thumbnail
Table 4. The minority variants occurring in at least one of the sequence platform—Data processing pipelines as a ≥1% variant in the HA segment of the DETU sample with a minimum coverage of 298 reads at that position.

https://doi.org/10.1371/journal.pone.0229326.t004

For brevity, the detailed results for the HA gene segment of the DETU virus are shown in Table 4. This virus segment was chosen because it showed the best reproducibility of results for ≥5% minority variants in all SP/DPP combinations. In the DETU HA segment, 33 positions containing a mSNV occurring in ≥1% of reads with sufficient coverage (≥298 reads) were identified. Only 3 of these positions (9%) were identified in all SP/DPP combinations. The majority of the positions (25/33, 76%) were only identified in one of the nine SP/DPP combinations. However, it needs to be noted that the SP3 data coverage was insufficient in all three DPPs to detect ≥1% variants for 11 of those positions (Table 4).

Although a comparison between the frequencies of the detected mSNVs might be appropriate, based on these results where even absence vs. presence of the mSNVs is poorly comparable further in-depth analyses on these frequencies is not performed because of its limited value.

Determining the influence of the minor variant detection method.

To isolate the effect of just the mSNV identification step in the DPP, independent of the alignment step, quality-trimmed alignment files (*.bam files) of the data (subdivided per virus, per SP and per DPP) were shared and subjected to the same DPP mSNV detection process (in this case DPP3) and compared to the original outcomes from DPP1 and DPP2 (Table 5). In the majority of positions, the different mSNV identification processes did not influence the results, as 84% (119/142) of the mSNVs were identified regardless of the mSNV identification process. Twenty-three mSNVs that were not reproduced by DPP3 mSNV identification analysis, were reproduced when the ‘Direction and position Filters’ in DPP3 were ignored (Table 5, marked with # of ##). These parameters filter out mSNVs when the set criteria for the read direction (variant must occur in both forward and reverse reads), relative read direction (statistical approach of forward/reverse balance) and read position (removal of systemic errors) are not met. However, DPP1 and DPP2 contain similar quality parameters in their mSNV identification process, indicating that different DPPs deal differently with quality parameters, and data could be excluded or included based on the DPP used. In addition, 9 additional mSNVs were identified in the *.bam files compared to the original mSNV outputs. It needs to be noted that the coverage of SP data analysed by DPP1 for positions identified with mSNVs was considerably lower compared to the coverage at that position in the input *.bam files, suggesting additional quality filtering in the mSNV detection step of DPP1. However, the influence on mSNV identification was limited most likely due to the initial high nucleotide coverage.

thumbnail
Table 5. The reproducibility of positions with at least one ≥5% variant when alignment files from the respective DPPs are all uploaded into DPP3 for only the mSNV identification process versus when the mSNV identifications are fully performed by the respective DPPs.

https://doi.org/10.1371/journal.pone.0229326.t005

To better visualise the differences in coverages and allele counts a graphical display of the data for four positions showing mSNVs in different frequencies for each SP/DPP combination is included in S2 Fig. In general, SNVs were rarely missed due to low coverage, as also high coverage SP/DPP combinations display discrepancies (Tables 3 and 4).

Discussion

NGS data are used for different applications. Although sequence technologies and the accompanying analysis tools are subjected to rapid development, a lot of follow-up research is based on initial findings. Accuracy and repeatability are key values for proper scientific research but the impact of NGS results also reaches beyond science to clinical settings where important clinical management and treatment decisions are based on such results. In this study the comparability of NGS data analyses were analysed using identical input material per virus but different laboratory workflows from nucleic acid extraction and sequencing to data analysis. In addition, the COMPARE “Data Hub” platform was tested for the purpose of sharing large raw datafiles between institutions in an outbreak situation. Using this platform, raw sequence data files up to the size of 8 Gigabytes, alignment files and metadata files of three influenza A/H5N8 viruses were successfully shared in real-time among 3 institutions to allow independent sequencing and analysis procedures, including mSNV identification, to be performed. The Data Hub is available to all institutions.

The aim of this study was to determine how comparable consensus and minority variant results were between laboratories performing their standard analyses, and whether discrepancies could be attributed to the SP, DPP or a combination of both. With the lack of a ground truth/gold standard, all data obtained were compared amongst each other. Importantly, reliable consensus sequences were generated independently of the SP/DPP combination used, although the well-known artefactual InDels in homopolymer regions in SP3 (Roche 454 genome sequencer) sequence data required manual editing. Such consensus sequences routinely form the basis for a detailed characterization of the influenza strain in an outbreak situation, as they are used for the prediction of pathogenicity and pandemic potential of influenza strains.

In contrast to the reproducible generation of consensus genome sequences, the hypothesis that minority variants could be identified reproducibly has to be rejected. The observed differences were mainly attributed to the alignment processes in the different DPPs. The interpretation of minority variant analysis thus needs a different level of careful standardization and awareness about the possible limitations as shown in this study. Reproducibility of mSNV results appeared to be influenced by both the different SPs (resulting in different sequence depths Fig 2) and DPPs (resulting in differences in alignment and mSNV identification of the same input data, Fig 2 and Table 5). There was limited reproducibility of mSNV identification data, even for relative high frequency mSNVs. As expected, the reproducibility was best (30%) for mSNVs occurring in high frequency (≥10%), and least for the low frequent (≥1%) mSNVs (9.4% to 31.1%). Also, the number of positions with 1–5% mSNVs (with sufficient coverage) was much higher (250 in SP1 data, 213 in SP2 data, and 45 in SP3 data) than the number of positions with >5–10% mSNVs (n = 27) or >10% mSNVs (n = 10).

The set-up of this study allowed many variables to influence the final result. The differences from first laboratory procedures and sample preparations up to the final analysis methods can all have contributed to the observed differences in mSNV identification. At this level, especially with lacking an NGS gold standard, it becomes difficult to determine which identified mSNVs are ‘true variants’ and which could be due to systematic errors introduced by RNA isolation methods, amplification, sequencing or manipulated by data processing pipeline settings. Unsurprisingly, the results of this study imply that the choice of SP influences the final output, but the results from this study also indicate that the DPP, especially the alignment process, influences coverage. The SP and DPP derived differences in coverage are of importance because up to a certain (currently unknown, probably SP/DPP dependent) threshold, a higher coverage will provide a more reliable result about the presence of mSNVs. Although the aim of this study was to explicitly compare the three institutions own standard workflows, some parameters (like the phred score and detection limit) were synchronized between the different DPPs. Moreover, the data from each SP were re-processed in each DPP. However, all DPPs use different underlying algorithms and interpret the set parameters differently which might all contribute to the observed differences. These results are partly in line with previous research that showed the need of NGS result validation and concluded that only those mSNVs with a coverage >100 and a frequency of >40% could be identified by NGS methods without secondary confirmation [32], however, this conclusion was based on using the same sample preparation method within a single laboratory. Another recent study sets the cut-off for intrahost virus diversity at 3% with input of at least 1000 RNA copies and a read depth of at least 400x at each genome position for Illumina sequencing [33].

Although some studies have been published on SP error rates [3437] and PCR amplification induced variants [3841], a gold standard system for mSNV analysis is lacking. In addition, the DPPs can alter the data due to elimination or inclusion of certain sequences based on the set quality parameters. Allowing too many low-quality reads or being too stringent on the data will influence the coverage per position and might also influence the accuracy of the mSNV identification rate, especially when the coverage is low [42, 43]. Although a low comparability of mSNVs identified in the different SP and DPP combinations was observed, it can be concluded that 454 (SP3) sequencing has approximately the same accuracy as Illumina (SP1 and 2) sequencing based on the number and percentage of reproducible mSNVs in this dataset when ignoring InDel errors in homopolymer regions. Although, Roche 454 sequencing machines are no longer in production, it added value to include 454 sequencing as an alternative sequence platform with alternative chemistry to Illumina. In addition, because Roche 454 was the first commercially successful next generation sequencing system, it was used in research that served as a fundament for follow-up studies [44]. A comparison of Illumina with newer third or fourth generation sequencing platforms (e.g. Nanopore or Pac Bio) would be interesting in the future. However, the overall error rate remains higher than the shorter read technologies and recent work concludes that these new platforms are currently not suitable for the detection of minor variants [33]. In addition, it would be interesting to compare mSNV results of SPs outputting small sequence reads (like Illumina, 454 and Ion Torrent) to new sequencing techniques that output full-length sequence data (e.g. Nanopore [45]). The latter might be less vulnerable to quality trimming parameters compared to small reads and might provide a more consistent nucleotide coverage over complete gene segment.

For mSNV analyses by different labs, very stringent SP/DPP protocols need to be evaluated, for instance by cross-validating results. To allow a better comparison it would be recommended to create some kind of gold standard by for instance evaluating parameters based on sequencing of technical replicates, and controlled mixes of clones. The mSNV analysis can be valuable for epidemiological tracing, to monitor early evolutionary events, or drug resistance, possibly host adaptation, but this would require reproducibility of study outcomes within and between laboratories. As this is currently not that case, more understanding of biases and errors generated by sample processing (enrichment procedures), sequencing strategy (amplicons, shotgun), sequencing chemistry (each of which have their own internal error rates) and the approach to data processing and analysis is needed. Understanding the parameters and thresholds in the software can be difficult and a systematic study using a pipeline where the effect of changing each of these parameters both individually and in combination is required to determine the optimal settings for minor variant analysis.

As alternate high-throughput sequencing technologies arise there will be a need to understand inherent error profiles and how those are handled in data processing approaches. Cross-validation should be supported by international proficiency tests on NGS techniques including mSNV analyses that would be instrumental in validation of results and may foster the trust in NGS-based diagnostics.

Supporting information

S1 Table. PCR primers used in SP3 to cover the influenza A H5N8 gene segments.

https://doi.org/10.1371/journal.pone.0229326.s001

(PDF)

S2 Table. SP/DPP overarching consensus sequences.

https://doi.org/10.1371/journal.pone.0229326.s002

(PDF)

S3 Table. Number of raw sequences and influenza virus reads per SP per virus.

https://doi.org/10.1371/journal.pone.0229326.s003

(PDF)

S1 Fig. Nucleotide coverage.

The non-normalised nucleotide coverage displayed as number of nucleotides per position for full genome sequences of the UKDD and DETU virus reads mapped to the corresponding reference sequences. Panel A shows the coverage results for the same SP dataset in the three different DPPs (DPP1: purple; DPP2: orange; DPP3 grey) for each of the SP datasets. Panel B shows the coverage when the same DPP is used to analyse data from the three different SPs (SP1: lilac; SP2: yellow; SP3:green) for each of the DPPs. The X-axis represents the position in the genome, the Y-axis represents the number of sequence reads per position.

https://doi.org/10.1371/journal.pone.0229326.s005

(TIF)

S2 Fig. Graphical display of the coverage and allele counts for four positions, showing mSNVs in different frequencies for each SP/DPP combination.

Arrows indicate the approximate percentages in which the mSNVs were detected; 1–5% (orange), 5–10% (purple) and >10% (green).

https://doi.org/10.1371/journal.pone.0229326.s006

(TIF)

Acknowledgments

The authors would like to thank the staff of the European Nucleotide Archive and all technical staff involved in the supporting laboratory and avian surveillance work.

References

  1. 1. Heather J. and Chain B, The sequence of sequencers: The history of sequencing DNA. Genomics, 2016. 107(1): p. 1–8. pmid:26554401
  2. 2. Van Dijk E, Auger H, Jaszczyszyn Y, Thermes C, Ten years of next-generation sequencing technology. Trends Genet, 2014. 30(9): p. 418–26. pmid:25108476
  3. 3. Ekblom R. and Galindo J., Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity (Edinb), 2011. 107(1): p. 1–15.
  4. 4. Köser C, Holden M, Ellington M, Cartwright E, Brown N, Ogilvy-Stuart A, et al., Rapid whole-genome sequencing for investigation of a neonatal MRSA outbreak. N Engl J Med, 2012. 366(24): p. 2267–75. pmid:22693998
  5. 5. Mellmann A, Harmsen D, Cummings C, Zentz E, Leopold S, Rico A, et al., Prospective genomic characterization of the German enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology. PLoS One, 2011. 6(7): p. e22751. pmid:21799941
  6. 6. Leitner T, Halapi E, Scarlatti G, Rossi P, Albert J, Fenyö E, et al., Analysis of heterogeneous viral populations by direct DNA sequencing. Biotechniques, 1993. 15(1): p. 120–7. pmid:8363827
  7. 7. Tsiatis A, Norris-Kirby A, Rich R, Hafez M, Gocke C, Eshleman J, et al., Comparison of Sanger sequencing, pyrosequencing, and melting curve analysis for the detection of KRAS mutations: diagnostic and clinical implications. J Mol Diagn, 2010. 12(4): p. 425–32. pmid:20431034
  8. 8. Glenn T. Field guide to next-generation DNA sequencers. Mol Ecol Resour, 2011. 11(5): p. 759–69. pmid:21592312
  9. 9. Li Y, Lei K, Kshatriya P, Gu, Jian, Ballesteros-Villagrana, et al., Ion Torrent™ Next Generation Sequencing–Detect 0.1% Low Frequency Somatic Variants and Copy Number Variations simultaneously in Cell-Free DNA. Thermo Fisher Scientific, 2017.
  10. 10. Schirmer M, D’Amore R, Ijaz U, Hall N, Quince C, Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics, 2016. 17: p. 125. pmid:26968756
  11. 11. Lou DI, Hussman J, McBee J, Acevedo A, Andino R, Press W, et al., High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci U S A, 2013. 110(49): p. 19872–7. pmid:24243955
  12. 12. World Organisation for Animal Health, O.I.E., Update on highly pathogenic avian influenza in animals (typeH5 and H7). 2014.
  13. 13. World Organisation for Animal Health, O.I.E., Update on highly pathogenic avian influenza in animals (typeH5 and H7). 2015.
  14. 14. Hanna A, Banks J, Marston D, Ellis R, Brookes S, Brown I, Genetic Characterization of Highly Pathogenic Avian Influenza (H5N8) Virus from Domestic Ducks, England, November 2014. Emerg Infect Dis, 2015. 21(5): p. 879–82. pmid:25898126
  15. 15. Harder T, Maurer-Stroh S, Pohlmann A, Starick E, Höreth-Böntgen D, Albrecht A, et al., Influenza A(H5N8) Virus Similar to Strain in Korea Causing Highly Pathogenic Avian Influenza in Germany. Emerg Infect Dis, 2015. 21(5): p. 860–3. pmid:25897703
  16. 16. Bouwstra R, Heutink R, Bossers A, Harders F, Koch G, Elbers A, Full-Genome Sequence of Influenza A(H5N8) Virus in Poultry Linked to Sequences of Strains from Asia, the Netherlands, 2014. Emerg Infect Dis, 2015. 21(5): p. 872–4. pmid:25897965
  17. 17. Verhagen J, Van Der Jeugd H, Nolet B, Slaterus R, Kharitonov S, De Vries P, et al., Wild bird surveillance around outbreaks of highly pathogenic avian influenza A(H5N8) virus in the Netherlands, 2014, within the context of global flyways. Euro Surveill, 2015. 20(12).
  18. 18. Poen M, Bestebroer T, Vuong O, Scheuer R, Van Der Jeugd H, Kleyheeg E, et al., Local amplification of highly pathogenic avian influenza H5N8 viruses in wild birds in the Netherlands, 2016 to 2017. Euro Surveill, 2018. 23(4).
  19. 19. Global Consortium for, H5N8 and Related Influenza Viruses, Role for migratory wild birds in the global spread of avian influenza H5N8. Science, 2016. 354(6309): p. 213–217. pmid:27738169
  20. 20. Harrison P, Alako B, Amid C, Cerdeno-Tárraga A, Cleland I, Holt S, et al., The European Nucleotide Archive in 2018. Nucleic Acids Research, 2019. 47(D1): p. D84–D88. pmid:30395270
  21. 21. Karsch-Mizrachi I, Takagi T, Cochrane G, The international nucleotide sequence database collaboration. Nucleic Acids Research, 2018. 46(D1): p. D48–D51. pmid:29190397
  22. 22. Amid C, Pakseresht N, Silvester N, Jayathilaka S, Lund O, Dynocski L, et al., The COMPARE Data Hubs. bioRxiv, 2019: p. 555938.
  23. 23. Richard M, Herfst S, Van Den Brand J, Lexmond P, Bestebroer T, Rimmelzwaan G, et al., Low Virulence and Lack of Airborne Transmission of the Dutch Highly Pathogenic Avian Influenza Virus H5N8 in Ferrets. PLoS One, 2015. 10(6): p. e0129827. pmid:26090682
  24. 24. Linster M, Van Boheemen S, De Graaf M, Schrauwen E, Lexmond P, Mänz B, et al., Identification, characterization, and natural selection of mutations driving airborne transmission of A/H5N1 virus. Cell, 2014. 157(2): p. 329–339. pmid:24725402
  25. 25. Li H, Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, 2013.
  26. 26. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078–9. pmid:19505943
  27. 27. Zerbino D and Birney E, Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 2008. 18(5): p. 821–9. pmid:18349386
  28. 28. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al., BLAST+: architecture and applications. BMC Bioinformatics, 2009. 10: p. 421. pmid:20003500
  29. 29. Hwang S, Kim E, Lee I, Marcotte E, Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep, 2015. 5: p. 17875. pmid:26639839
  30. 30. Hall T, BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symposium Series, 1999. 41: p. 95–98.
  31. 31. Dell R, Holleran S, Ramakrishnan R, Sample size determination. ILAR J, 2002. 43(4): p. 207–13. pmid:12391396
  32. 32. Mu W, Lu H, Chen J, Li S, Elliott A, Sanger Confirmation Is Required to Achieve Optimal Sensitivity and Specificity in Next-Generation Sequencing Panel Testing. J Mol Diagn, 2016. 18(6): p. 923–932. pmid:27720647
  33. 33. Grubaugh N, Gangavarapu K, Quick J, Matteson N, Goes De Jesus J, Main B, et al., An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol, 2019. 20(1): p. 8. pmid:30621750
  34. 34. Golan D and Medvedev P, Using state machines to model the Ion Torrent sequencing process and to improve read error rates. Bioinformatics, 2013. 29(13): p. i344–51. pmid:23813003
  35. 35. Manley L, Ma D, and Levine S, Monitoring Error Rates In Illumina Sequencing. J Biomol Tech, 2016. 27(4): p. 125–128. pmid:27672352
  36. 36. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, et al., Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res, 2011. 39(13): p. e90. pmid:21576222
  37. 37. Shao W, Boltz V, Spindler J, Kearney M, Maldarelli F, Mellors J, et al., Analysis of 454 sequencing error rate, error sources, and artifact recombination for detection of Low-frequency drug resistance mutations in HIV-1 DNA. Retrovirology, 2013. 10: p. 18. pmid:23402264
  38. 38. Acinas S, Sarma-Rupavtarm R, Klepac-Ceraj V, Poltz M, PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Appl Environ Microbiol, 2005. 71(12): p. 8966–9. pmid:16332901
  39. 39. Gorzer I, Guelly C, Trajanoski S, Puchhammer-Stöckl E, The impact of PCR-generated recombination on diversity estimation of mixed viral populations by deep sequencing. J Virol Methods, 2010. 169(1): p. 248–52. pmid:20691210
  40. 40. Judo M, Wedel A, and Wilson C, Stimulation and suppression of PCR-mediated recombination. Nucleic Acids Res, 1998. 26(7): p. 1819–25. pmid:9512558
  41. 41. Meyerhans A, Vartanian J, and Wain-Hobson S, DNA recombination during PCR. Nucleic Acids Res, 1990. 18(7): p. 1687–91. pmid:2186361
  42. 42. Quail M, Smith M, Coupland P, Otto T, Harris S, Connor T, et al., A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics, 2012. 13: p. 341. pmid:22827831
  43. 43. Sims D, Sudbery I, Ilott N, Heger A, Ponting C, Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet, 2014. 15(2): p. 121–32. pmid:24434847
  44. 44. Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al., Comparison of next-generation sequencing systems. J Biomed Biotechnol, 2012. 2012: p. 251364. pmid:22829749
  45. 45. Keller M, Rambo-Martin B, Wilson M, Ridenour C, Shepard S, Start T, et al., Direct RNA Sequencing of the Coding Complete Influenza A Virus Genome. Sci Rep, 2018. 8(1): p. 14408. pmid:30258076