Anchor Clustering for million-scale immune repertoire sequencing data-Reference-Cited by-同舟云学术

Anchor Clustering for million-scale immune repertoire sequencing data

Published:2023-07-03 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Chang Haiyang¹,Ashlock Daniel A.¹,Graether Steffen P.¹,Keller Stefan M.²

Affiliation:

1. University of Guelph

2. University of California Davis

Abstract

Abstract Background The clustering of immune repertoire data is challenging due to the computational costs associated with a very large number of pairwise sequence comparisons. To overcome this limitation, we developed Anchor Clustering, an unsupervised clustering method designed to identify similar sequences from millions of antigen receptor gene sequences. First, a Point Packing algorithm is used to identify a set of maximally spaced anchor sequences. Then, the genetic distance of the remaining sequences to all anchor sequences is calculated and transformed into distance vectors. Finally, sequences are clustered using unsupervised clustering. This process is repeated iteratively until the resulting clusters are small enough so that pairwise distance comparisons can be made. Results Our results demonstrate that Anchor Clustering is faster than existing pairwise comparison clustering methods while providing similar clustering quality. With its flexible, memory-saving strategy, Anchor Clustering is capable of clustering millions of antigen receptor gene sequences in just a few minutes. Conclusions This method enables the meta-analysis of immune-repertoire data from different studies and could contribute to a more comprehensive understanding of the immune repertoire data space.

Publisher

Research Square Platform LLC

Reference30 articles.

1. History, applications, and challenges of immune repertoire research;Liu X;Cell biology and,2018

2. Commonality despite exceptional diversity in the baseline human antibody repertoire;Briney B;Nature,2019

3. Murphy K. Weaver C. Janeway’s immunobiology. New York: Garland Science. Taylor & Francis Group; 2016.

4. Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity;Shugay M;Nucleic Acids Res,2018

5. The immune epitope database (iedb): 2018 update;Vita R;Nucleic Acids Res,2019