Likelihood-Based Inference of B Cell Clonal Families

Duncan K. Ralph; Frederick A. Matsen IV

doi:10.1371/journal.pcbi.1005086

Abstract

The human immune system depends on a highly diverse collection of antibody-making B cells. B cell receptor sequence diversity is generated by a random recombination process called “rearrangement” forming progenitor B cells, then a Darwinian process of lineage diversification and selection called “affinity maturation.” The resulting receptors can be sequenced in high throughput for research and diagnostics. Such a collection of sequences contains a mixture of various lineages, each of which may be quite numerous, or may consist of only a single member. As a step to understanding the process and result of this diversification, one may wish to reconstruct lineage membership, i.e. to cluster sampled sequences according to which came from the same rearrangement events. We call this clustering problem “clonal family inference.” In this paper we describe and validate a likelihood-based framework for clonal family inference based on a multi-hidden Markov Model (multi-HMM) framework for B cell receptor sequences. We describe an agglomerative algorithm to find a maximum likelihood clustering, two approximate algorithms with various trade-offs of speed versus accuracy, and a third, fast algorithm for finding specific lineages. We show that under simulation these algorithms greatly improve upon existing clonal family inference methods, and that they also give significantly different clusters than previous methods when applied to two real data sets.

Author Summary

Antibodies must recognize a great diversity of antigens to protect us from infectious disease. The binding properties of antibodies are determined by the DNA sequences of their corresponding B cell receptors (BCRs). These BCR sequences are created in naive form by VDJ recombination, which randomly selects and trims the ends of V, D, and J genes, then joins the resulting segments together with additional random nucleotides. If they pass initial screening and bind an antigen, these sequences then undergo an evolutionary process of reproduction, mutation, and selection, revising the BCR to improve binding to its cognate antigen. It has recently become possible to determine the BCR sequences resulting from this process in high throughput. Although these sequences implicitly contain a wealth of information about both antigen exposure and the process by which we learn to resist pathogens, this information can only be extracted using computer algorithms. In this paper we describe a likelihood-based statistical method to determine, given a collection of BCR sequences, which of them are derived from the same recombination events. It is based on a hidden Markov model (HMM) of VDJ rearrangement which is able to calculate likelihoods for many sequences at once.

Citation: Ralph DK, Matsen FA IV (2016) Likelihood-Based Inference of B Cell Clonal Families. PLoS Comput Biol 12(10): e1005086. https://doi.org/10.1371/journal.pcbi.1005086

Editor: Bjoern Peters, La Jolla Institute for Allergy and Immunology, UNITED STATES

Received: October 8, 2015; Accepted: July 27, 2016; Published: October 17, 2016

Copyright: © 2016 Ralph, Matsen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The "Adaptive" data set is available at http://adaptivebiotech.com/link/mat2015. The "Vollmers" data set is available at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000656.v1.p1.

Funding: Both authors were supported by National Institutes of Health R01 GM113246 (PI Matsen), R01 AI103981, and U19 AI117891. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

This is a PLOS Computational Biology methods paper.

Introduction

B cells effect the antibody-mediated component of the adaptive immune system. The antigen-binding properties of B cells are defined by their B cell receptor, or BCR. BCRs bind a wide variety of antigens, and this flexibility arises from their developmental pathway. B cells begin life as hematopoietic stem cells. After a number of differentiation steps the cells perform somatic recombination, or rearrangement. For the heavy chain locus, a V gene, D gene, and J gene are randomly selected, trimmed some random amount by an exonuclease, and then joined together with random nucleotides (forming so-called N-regions). The light chain process is slightly simpler, in that only a V and J recombine, but proceeds via similar trimming and joining processes. These processes form the third complementarity determining region (CDR3) in each of the heavy and the light chain, which are important determinants of antibody binding properties. Then a series of checkpoints on the BCRs ensure that the resulting immunoglobulin is functional and not self-reactive through negative selection (reviewed in [1]). This process results in naive B cells with fully functioning receptors. When stimulated by binding to antigen in a germinal center, naive cells reproduce and mutate by via the process of somatic hypermutation, and then are selected on the basis of antigen binding and presentation to T follicular helper cells [2]. This process is called affinity maturation. It is now possible to sequence B cell receptors in high throughput, which in principle describes not only the collections of antigens to which the immune system is ready to react, but also implicitly narrates how they came to be.

It is of great practical interest for researchers to be able to reconstruct events of this development process using BCR sequence data. Such reconstruction would shed light on the process of B cell receptor maturation, a subject of continual study since the landmark work of Eisen and Siskind in 1964 [3, 4]. Furthermore, there are specific maturation pathways of great importance, such as the B cell lineages leading to broadly neutralizing antibodies to HIV [5, 6]. Being able to reconstruct the structure and history of these lineages allows investigation of the binding properties of these intermediates, which could be helpful to design effective vaccination strategies to elicit high-affinity antibodies [7]. For example, recent studies have shown the promise of a sequential immunization program for eliciting these antibodies [8]; lineage reconstruction will aid in identifying desirable intermediate BCRs.

The clonal family inference problem is an intermediate step to such lineage reconstruction (Fig 1). Rather than trying to reconstruct the full lineage history of the set of sequences, the goal is only to reconstruct which sequences came from the same rearrangement event. Full lineage reconstruction would also require building phylogenetic trees for each of the clonal families. However, these clonal families can be an object of interest themselves [9].

Download:

Fig 1. The clonal family inference problem.

The B cell receptor generation process begins by VDJ recombination, which makes a naive B cell. When stimulated by antigen, those naive cells diversify through the mutation and selection processes of affinity maturation, creating many lineages of B cells shown here as phylogenetic trees with the naive cells at the root of the tree. The ensemble of B cells descending from a single rearrangement event is called a clonal family. In this paper we develop methods to reconstruct clonal families from B cell receptor sequences.

https://doi.org/10.1371/journal.pcbi.1005086.g001

The motivation behind our approach to the clonal family inference problem, like many before us, is to use the special structure of BCR sequences (which for simplicity we describe for the heavy chain; the same concepts and approaches can be applied to the light chain). This structure follows from VDJ recombination and affinity maturation: for example, by definition the identity of the germline genes cannot change through affinity maturation. Thus, if the per-read germline gene identity could be inferred without error, then any pair of sequences from a clonal family must have the same inferred germline gene identity. If one also assumes that sequences evolve only through point mutation, then sequences must have identical-length CDR3s if they are to be in the same clonal family.

Most current methods for B cell clonal family inference make these assumptions, and proceed by first stratifying sequences by inferred V and J germline genes and CDR3 length, then only consider pairs of sequences within a stratum as potential members of the same clonal family. If one assumes further that any clonal families with pairs of highly diverged sequences also contain intermediates between those sequences, one might assume that there is a path between any pair of sequences such that neighboring sequences in a path are similar. This suggests a strategy in which pairs of sequences that are similar at some level (such as 90% similar in terms of nucleotides) in the CDR3 are considered to be in the same clonal family, and where membership is transitive, which corresponds to an application of single-linkage clustering.

Instead of designing such an algorithm that works only when a set of rigid, predefined assumptions are satisfied, an alternative is to formalize a model of B cell affinity maturation into a generative probabilistic process with a corresponding likelihood function. Once this likelihood function is defined, one can infer clonal families by finding the clustering that maximizes the likelihood of generating the observed sequences.

Likelihood methods in the form of a hidden Markov models (HMM) have been applied to B cell receptor sequences for a decade [10–13]. This previous work has been to use HMMs to analyze individual sequences. For likelihood-based clustering we are only aware of the work of Laserson [14, 15], who uses Markov chain Monte Carlo to infer clusters via a Dirichlet mixture model (reviewed in [16]). Unfortunately the Laserson algorithm is only described in a PhD thesis and does not appear to be publicly available. In related work, Kepler [17, 18] uses a likelihood-based phylogenetics framework to perform joint reconstruction of annotated ancestor sequence and a phylogenetic tree.

In this paper we present a method for inferring clonal families in an HMM-based framework that comfortably scales to tens of thousands of sequences via parallel algorithms, with approximations that scale to hundreds of thousands of sequences. For situations in which specific lineages are of interest, users can specify “seed” sequences and find the clonal family containing that seed in repertoires with one million sequences. Our clustering algorithm is based on a “multi-HMM” framework for BCR sequences that we have previously applied to the annotation problem: to infer the origin of each nucleotide in a BCR (or TCR) sequence from the VDJ rearrangement process [19]. We use this framework to define a likelihood ratio comparing two models which differ by the collapse of two clonal families into one, and use it for agglomerative clustering. Because this likelihood ratio comes from an application of the forward algorithm for HMMs, it integrates out all possible VDJ annotations. We find that it outperforms previous algorithms on simulated data, and that it makes a significant difference when applied to real data.

Results

Likelihood framework

In order to calculate a set of probabilities suitable for use in the clonal family inference problem, we begin with the HMM framework introduced in [19]. In that paper we focused on inferring parameters of an HMM and using it to obtain BCR annotated ancestor sequences, which was primarily based on the most likely path through each HMM, i.e. the Viterbi path. We also described Viterbi annotation with a multi-HMM, i.e. annotation using a collection of sequences that were assumed to form a clonal family.

In this application, we will use the forward algorithm for HMMs [20] to obtain the corresponding marginal probability, which is the sum of sequence generation probabilities over all possible paths through the HMM. This is a more appropriate tool for the clonal family inference problem because here we are interested in integrating over annotated ancestor sequences (that is, paths through the HMM) to decide whether sequences are related. By using a multi-HMM, we can use this total probability to calculate a likelihood ratio that two clusters derive from the same, or from different, rearrangement events. We perform agglomerative clustering using this likelihood ratio to group sequences for which the probability of a common ancestry is higher than that of separate ancestry (details in the Methods). This approach allows us to calculate the total probability of the partition (i.e. clustering) at each stage in the clustering process, which provides both an objective measure of partition quality, and easy access to not only the most likely partition but also to a range of likely partitions of varying degrees of refinement. As in our previous work, the parameters of the HMM can be inferred “on the fly” given a sufficiently large data set or be inferred on some other data set. Briefly, we do a cycle of Viterbi training, which is started with an application of Smith-Waterman alignment, in which the best annotation for each sequence with a current parameter set is used to infer parameters for the next cycle. As described in detail elsewhere [19], data is aggregated if there are insufficient observations for a given allele for training.

Approximate methods

In addition to this principled method for full-repertoire reconstruction, we have implemented two more approximate versions which trade some accuracy for substantial increases in speed. In the first, which we call point partis, we forgo integration over all possible annotated ancestor sequences and instead find the most likely naive sequence point estimate for each cluster. Clusters are then compared based on the Hamming fraction (Hamming distance divided by sequence length) between their respective naive sequences, and are merged if the distance is smaller than some threshold. This threshold is set dynamically based on the observed mutation rate in the sample at hand.

In order to achieve further improvements in speed, we can also avoid both complete all-versus-all comparison of the sequences at each step, and calculation of the joint naive sequence for each merged cluster. For this we find the most likely naive sequence for each individual sequence, and then pass the results, together with a dynamically-set clustering threshold, to the clustering functionality of the vsearch program [21]. We call this vsearch partis.

Reconstruction of selected lineages

We have also included a method which, using the full likelihood, reconstructs the clonal family containing a given “seed” sequence. Because clonal families are generally significantly smaller than the total repertoire, this option is much faster than the full-repertoire reconstruction methods. We see this option as being useful when specific sequences are identified as interesting through a binding assay or because they are shared between repertoire samples. This is labeled full partis (seed).

Implementation

This clustering has been implemented as part of continued development of partis (http://github.com/psathyrella/partis). As before, the license is GPL v3, and we have made use of continuous integration and containerization via Docker for ease of use and reproducibility [22]. A Docker image with partis installed is available at https://registry.hub.docker.com/u/psathyrella/partis/.

Results on simulation

In the absence of real data sets with many sequences for which the true annotations and lineage structures are known, we compare these new clustering methods against previous methods using simulated sequences generated as described in [19]. These simulations were done for the heavy chain locus only. We performed comparison both on samples, which we call 1×, which mimic mutation frequencies in data (overall mean frequency of about 10%) and on samples, which we call 4×, with quadrupled branch lengths (overall mean frequency of about 25%) to explore results in a more challenging regime. Per-sequence mutation frequencies are distributed according to the empirical distribution (see [19]). We compare the three partis methods to three methods from the literature. The first, labeled “VJ CDR3 0.9”, is representative of annotation- and distance-based methods which have been used in a number of papers [18, 23–26]. It begins by annotating each individual sequence, and proceeds to group sequences which share the same V and J gene and the same CDR3 length, and have CDR3 sequence similarity above some threshold, which is commonly 0.9 [24]. For this comparison we use partis annotation; for a comparison of annotation methods themselves see [19]. We also compare against Change-O’s clustering functionality [27] fed with annotations from IMGT, with IMGT failures (when it does not return an annotation) classified as singletons. We perform a partial comparison against MiXCR [28]. Since this method does not currently report which sequences go into which clusters, and instead only reports cluster summary statistics, we cannot perform a detailed evaluation. The authors of MiXCR note in personal communication, however, that they plan to report this information in future versions.

We use per-read averages of precision and sensitivity to quantify clustering accuracy. In this context, the precision for a given read is the fraction of sequences in its inferred cluster which are actually in its clonal family, while sensitivity for a given read is the fraction of sequences in its true clonal family that appear in its inferred cluster (details in Methods). We find that partis is much more sensitive than previous methods, at the cost of some loss of precision (Fig 2). The point partis approximate implementation is less specific than the full implementation, while the even faster vsearch approximation loses some precision and some sensitivity.

Download:

Fig 2. Similarity between inferred and true partitions for the various clustering methods at typical (1×) mutation levels via per-read averages of precision (top left), sensitivity (top right), and their harmonic mean (bottom, called F1 score).

Results are on simulated sequences which span the entire V, D, and J segments; the number of leaves (BCR sequences per clonal family) is distributed geometrically with the indicated mean value. Precision measures the extent to which inferred clusters contain truly clonal sequences, while sensitivity measures the extent to which the entirety of each sequence’s clonal family appears in its inferred cluster.

https://doi.org/10.1371/journal.pcbi.1005086.g002

We investigate these differences in more detail for the first simulation replicate via an intersection matrix with entries equal to the size of the intersection between each of the 40 largest clusters returned by pairs of algorithms (Figs 3 and 4, S6, S7, S8 and S9 Figs). Full partis infers clonal families correctly the majority of the time at typical mutation levels, and in this experiment it incorrectly split a cluster of true size around 45. These results degraded somewhat with the point partis approximation, and somewhat more with the vsearch approximation. The VJ CDR3 0.9 method consistently under-clustered for the largest cluster sizes. The seeded full partis method correctly reconstructed the lineage of interest starting from a randomly sampled sequence, while ignoring all others.

Download:

Fig 3. Fraction of sequences per cluster in common with the true partition on simulation for each method at typical (1×) levels of mutation.

For these plots, we took the 40 largest clusters resulting from the given clustering and took their intersection with the 40 largest clusters generated by the simulation. Each non-white square indicates that there was a non-empty intersection between the two clusters; the square is shaded by the size of the clusters’ intersection divided by their mean size. The position of the square shows the relative sizes of the two clusters. Results are shown for the simulation sample in which the size of each clonal family is drawn from a geometric distribution with mean 50 (other values are shown in S6 and S8 Figs).

https://doi.org/10.1371/journal.pcbi.1005086.g003

Download:

Fig 4. Fraction of sequences per cluster in common with the true partition on simulation for each method with high (4×) mutation.

Plot layout as in Fig 3. Results are shown for the simulation sample in which the size of each clonal family is drawn from a geometric distribution with mean 50 (other values are shown in S7 and S9 Figs).

https://doi.org/10.1371/journal.pcbi.1005086.g004

In order to understand performance on the many smaller clusters and to get a simpler overall picture, we also compared cluster size distributions for the various methods with the simulated distribution (Figs 5 and 6, S1 and S2 Figs). Here we can see that partis is able to accurately infer the true cluster size in a variety of regimes, whereas other methods tend to under-merge clusters of all sizes.

Download:

Fig 5. True and inferred cluster size distributions at normal (1×) mutation levels for each of the methods for geometrically distributed simulated cluster sizes with various means.

Results are the mean of three simulated samples with 1000 sequences each.

https://doi.org/10.1371/journal.pcbi.1005086.g005

Download:

Fig 6. True and inferred cluster size distributions at high mutation levels (×4) for each of the methods for geometrically distributed simulated cluster sizes with various means.

Results are the mean of three simulated samples with 1000 sequences each.

https://doi.org/10.1371/journal.pcbi.1005086.g006

In order to further understand the source of these differences, we also compare results against two methods of generating incorrect partitions starting from the true partition, which we call synthetic partitions (S3 and S4 Figs). The first, called synthetic 60% singleton is generated from the true partition by splitting 60% of the sequences into singleton clusters. The second, called synthetic neighbor 0.03, merges together true clonal families which have true naive sequences closer than 0.03 in Hamming distance divided by sequence length. We find that the performance of synthetic 60% singleton tracks that of the VJ CDR3 method, while the performance of synthetic neighbor 0.03 tracks that of partis.

Finally, to investigate the performance of the seeded full partis method, we calculate the precision and sensitivity of this method on a number of widely varying sample sizes (Fig 7). For these simulations we used a Zipf (power-law) distribution of cluster sizes with exponent 2.3, and randomly selected one seed sequence from a randomly selected large cluster. We find that seeded partis frequently obtains very high sensitivity, although precision decreases as sample size increases. This precision decrease is from incorrect merges of clusters. We have manually checked these incorrect merges, and found that the true (i.e. simulated) naive sequences of clusters which are incorrectly merged with the seeded cluster typically differ by one to six bases. Because these differences occur either within the bounds of the true eroded D segment, or within the true non-templated insertions, it is difficult to distinguish them from somatic hypermutation. This echoes the observation that partis precision is driven by the presence or absence of clusters which stem from different rearrangements, but which are very similar in naive sequence (compare partis and synthetic neighbor 0.03 in S5 Fig).

Download:

Fig 7. Similarity between inferred and true partitions for seed partis via per-read averages of precision (left) and sensitivity (right) with increasing sample size.

There is one point at each indicated x value except for one hundred thousand, five hundred thousand, and one million, which have three points each. Results are shown for a sample with cluster sizes distributed as a Zipf (power law) distribution with exponent 2.3, given a randomly selected seed sequence from a randomly selected large cluster.

https://doi.org/10.1371/journal.pcbi.1005086.g007

Insertion and deletion mutations

In order to handle insertion-deletion (indel) mutations which occur during somatic hypermutation, we have implemented a heuristic method in the preliminary Smith-Waterman alignment step in partis. In short, this works by “reversing” inferred indel mutations in germline-encoded regions and proceeding with the clustering algorithm. We find that partis performance is typically unaffected when indels occur in non-CDR3 germline-encoded regions, although performance suffers when indels occur in the CDR3 (Fig 8). This is because indel mutations in the CDR3 are quite difficult to distinguish from insertions and deletions stemming from the VDJ rearrangement process using indel-handling schemes (such as ours) that only take one sequence at a time into account.

Download:

Fig 8. Overall clustering quality, parameterized by the harmonic mean of precision and sensitivity (the F1 score), in the presence of indel mutations.

For these simulations, half of the simulated sequences have a single indel, whose position is distributed evenly either in the V segment (left, specifically between position 10 and the conserved cysteine) or in the CDR3 (right). Indels substantially decrease performance only when they occur within the CDR3. Results are the mean of 3 samples of 1000 sequences each.

https://doi.org/10.1371/journal.pcbi.1005086.g008

Application to data

In order to understand the difference this method makes on real data, we applied partis and the other algorithms to subjects in the Adaptive data set from [29] used in previous publications [19, 30, 31], as well as the data set from [24], which we will call the “Vollmers” data set. These data sets were Illumina sequenced via amplicons covering the heavy chain CDR3, and thus do not have complete V or J sequences. Especially in the case of the V region for the Vollmers data, it is not possible to confidently identify the germline V gene for each of the BCR sequences. Thus, these data sets make for an interesting comparison between methods (such as VJ CDR3) which require single germline gene identifications, to our method, which integrates over such identifications. Results are shown for Adaptive subject A (Fig 9), and for a subject from the Vollmers data set (Fig 10). The rest may be found on figshare at http://figshare.com/s/9b85e4ac54d011e5bd3e06ec4b8d1f61. Note that the identifiers shown for the Vollmers data are an obfuscated version of the original identifiers in the data; contact the authors for more details. These results are not presented to make any strong statement about the true cluster size distribution, the correctness of which cannot be be independently evaluated, but rather to show that the partis results are different from those of other methods on real data, as seen under simulation.

Download:

Fig 9. Results of the various methods on data from subject A in the Adaptive data set.

We show cluster size distributions (top) and intersection matrices, which show the fraction of sequences per cluster in common between the various methods. Results are on a randomly-chosen subsample of 20,000 sequences.

https://doi.org/10.1371/journal.pcbi.1005086.g009

Download:

Fig 10. Results of the various methods on data from subject 15-12 in the Vollmers data set.

We show cluster size distributions (top) and intersection matrices, which show the fraction of sequences per cluster in common between the various methods. Results are on a randomly-chosen subsample of 20,000 sequences.

https://doi.org/10.1371/journal.pcbi.1005086.g010

When we applied the various methods to a randomly chosen set of 20,000 sequences from two different sets, we found that the various methods agree that both samples are dominated by singletons, but there is substantial discord at the high end of the distribution, especially in Adaptive subject A (Figs 9 and 10). These differences in composition are examined in more detail using cluster intersection matrices. The cluster size distribution inferred by partis approximately follows a power-law, with exponent about 2.3.

Adaptive subject A (Fig 9) has mutation levels two and a half times higher than Vollmers subject 15-12 (Fig 10), making inference more challenging for A. Both of these data sets consist of shorter sequences than the simulated sequences, which contain the entire V and J regions. Reads in the Adaptive samples are 130 base pairs (losing about two thirds of the V and one half of the J), while those in the Vollmers data set vary in length, but typically span all of the J but only 20 to 30 bases in the V.

Time required

Likelihood-based clustering using partis is computationally demanding, though within a range applicable to real questions given appropriate computing power (Fig 11). On a computing cluster with about 25 8-core machines, full and point partis can cluster ten thousand sequences in 4 to 7 hours, while vsearch partis can cluster one hundred thousand sequences in 4 hours. Our implementation of “VJ CDR3 0.9” used partis annotation, but this approach could be made much faster by using a fast method for annotation [28, 32]. Time required can also vary by an order of magnitude depending on the structure of the sample (cluster size and mutation level).

Download:

Fig 11. Run time for the various methods.

Results are from running on a cluster with about 25 8-core machines. The time required for Change-O is difficult to measure, as the sequences are first annotated by manual submission to the IMGT website, which takes from 1–6 days to return results. The actual clustering time for Change-O once these annotations are obtained is very small, on par with the MiXCR results shown in this plot. Time required also varies by an order of magnitude depending on the structure of the sample (cluster size and mutation level).

https://doi.org/10.1371/journal.pcbi.1005086.g011

Discussion

We have developed an algorithm to infer clonal families using a likelihood-based framework. Although the framework does take annotation information into account by using a VDJ-based HMM, the algorithm is distinguished from other clustering methods in that it does not fix a single annotation first and then use that annotation for downstream steps. Instead we find that by integrating over annotated ancestor sequences using an HMM, we are able to obtain better clonal family inference than with the current common practice of rigidly inferring VJ annotation and then clustering on HCDR3 identity for heavy chain sequences. Our simulations show that existing algorithms frequently do not sufficiently cluster sequences which sit in the same clonal family. Our application to real data shows that the partis algorithms using our default clustering thresholds return more large clusters on two real data sets, indicating that this difference in clustering is not simply an artifact of our simulation setup.

The performance differences between our various approximate algorithms indicates the sources of the partis’ improved performance. The reasonably good performance of the point partis variant shows the importance of clustering on inferred naive sequences rather than observed sequences and inferring these naive sequences with an accurate probabilistic method. Furthermore, the difference between point and full partis is some measure of the importance of integrating out uncertainty in annotated ancestor sequences.

We find that partis’ main weakness is in separating out clusters with highly similar naive sequences. Indeed, its performance tracks a simulated method that merges clonal families with true (i.e. simulated) naive sequences that are closer than 3% in nucleotides, in simulations with about 10% divergence from the naive sequence. Although the VDJ rearrangement process generates a very diverse repertoire, biases in gene family use and other rearrangement parameters mean that pairs of highly similar naive sequences are frequently generated. This may indicate an inherent limitation in clonal family inference methods that only use data from heavy chain.

Our method builds on previous work for doing likelihood-based analysis of BCR sequences. In particular, we are indebted to Tom Kepler for initiating the use of HMMs in BCR sequence analysis [10] and for developing likelihood-based methods to infer unmutated common ancestor sequences while integrating over rearrangement uncertainty [17, 18, 33–35].

We did not compare to several related methods that have been described in the literature. ClonalRelate [36] is an extension to the “VJ CDR3” method that allows some flexibility in requiring V and J calls to be the same by combining various mismatch penalties into a distance that is used for agglomerative clustering. IMSEQ [32] is a recent method which is reported to be quite fast; however the current version appears mainly aimed toward T cell receptors, as it does not handle somatic hypermutation. As it clusters based on V and J genes and 100% CDR3 similarity, it is equivalent to the annotation-based method described above, except with a threshold inappropriate to B cells. Cloanalyst performs joint reconstruction of annotated ancestor sequence and a phylogenetic tree given a collection of sequences assumed to form a clonal family [17]. Immunitree apparently uses a Dirichlet process mixture model for clustering, however, the algorithm is only fully described in a PhD thesis [14], and does not appear to be publicly available (note that https://github.com/laserson/vdj performs straightforward single-linkage clustering and is in fact written by a sibling of the Immunitree author). IgSCUEAL [37] is a recent method that performs annotation and clustering using a phylogenetic approach. Its clustering algorithm, however, is not part of the public distribution and is apparently undergoing revision.

There are several opportunities to improve partis. First, our current approach requires likelihood ratios to exceed a value based on cluster size; these cluster sizes are based on observing distributions of likelihood ratios under simulation. A more principled approach would be preferable. Second, our approach to insertion-deletion mutations in affinity maturation only uses one sequence at a time. Thus it has an inherent difficulty differentiating between mutations in the course of affinity maturation versus insertion-deletion events that are part of VDJ rearrangement. Third, our current code is only for the heavy chain alone or the light chain chain alone. Extending the work to paired heavy and light chain BCR data is conceptually straightforward, although will require additional software engineering. Fourth, HMMs have certain inherent limitations, stemming from the central Markov assumption that the current state is ignorant of all states except for the previous one. As reviewed in [19], this limits the scope of events that can be modeled using partis, excluding correlation between different segments of the BCR [31, 38, 39], palindromic N-additions [40], complex strand interaction events [41, 42], or the appearance of tandem D segments [43]. Some of these limitations could be avoided by using Conditional Random Fields (reviewed in [44]), and although linear-chain conditional random fields enjoy many of the attractive computational properties of HMMs, this flexibility will come with a computational cost. Fifth, partis does not attempt to infer germline genotype, as do [45], and so treats genes and alleles on an equal footing. We will treat this as a model-based inference problem in future development. Sixth, we will continue to refine heuristics to provide the accuracy of the full likelihood-ratio calculation with minimal compute time. We note, for instance, that a small decrease in the lower naive Hamming fraction threshold substantially improves performance for the seed partis simulation compared to that shown here (in Fig 7).

In additional future work, we will explore opportunities to combine clonal family inference and phylogenetics to obtain inference of complete B cell lineages. This could potentially take the form of a phylo-HMM [46], although a more straightforward approach would be to take the product of a phylogenetic likelihood and a rearrangement likelihood [17]. For example, one might use HMM-based clustering as is described here with a high likelihood ratio cutoff to obtain a conservative collection of clusters, and then a phylogenetic criterion to direct further clustering.

In addition to these methodological improvements, we will also apply partis to a variety of data sets for validation and to learn about the structure of natural repertoire. For validation, there are some data sets, e.g. [47], which due to experimental setup have sequences known to make a clonal lineage. Also, new microfluidics technology applied to BCR sequencing also gives heavy and light chain data [48, 49]; although a single heavy chain clonal lineage can have light chains from independent rearrangement events, this type of data does provide further evidence of clonality for validation of clonal family inference procedures. In addition to this sort of validation, there are now an abundance of data sets that can be used to characterize the size distribution of the clonal families in various immune states, such as health, immunization, and disease.

As a final note, partis works to solve a challenging likelihood-based inference problem. We recognize that in contrast to existing heuristic approaches based on sequence identity, our software is quite computationally demanding. In this first paper we have developed the framework and overall approach, as well as many computational optimizations. This optimization work is ongoing, and there remain many avenues for improvement. As a comparison, likelihood-based phylogenetic inference has taken two decades of optimization to scale to tens of thousands of sequences at a time with approximate algorithms [50]. We are continually making improvements to the algorithm to make it scale to larger data sets and are committed to building algorithms that scale to the size of contemporary data sets. Although such algorithms may end up being rather different than this version of partis, we believe that likelihood-based algorithms will provide a solid foundation for large-scale molecular evolution studies of B cell maturation.