ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection-Reference-Cited by-同舟云学术

ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection

Published:2024 Issue: Volume:13 Page:
ISSN:2047-217X
Container-title:GigaScience
language:en
Short-container-title:

Author:

Chu Justin¹²^ORCID,Rong Jiazhen³^ORCID,Feng Xiaowen¹²^ORCID,Li Heng¹²^ORCID

Affiliation:

1. Dana-Farber Cancer Institute, Department of Data Sciences , Boston, MA 02215, USA

2. Harvard Medical School, Department of Biomedical Informatics , Boston, MA 02115, USA

3. Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Genomics and Computational Biology Graduate Program,

Abstract

Abstract Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important. Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio–based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)–based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons. Conclusions Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.

Funder

National Human Genome Research Institute

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/gigascience/article-pdf/doi/10.1093/gigascience/giae024/58080777/giae024.pdf

Reference40 articles.

1. BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters;Chu;Bioinformatics,2014

2. Contamination detection in genomic data: more is not enough;Cornet;Genome Biol,2022

3. Mash: fast genome and metagenome distance estimation using MinHash;Ondov;Genome Biol,2016

4. Robust relationship inference in genome-wide association studies;Manichaikul;Bioinformatics,2010

5. Conpair: concordance and contamination estimator for matched tumor–normal pairs;Bergmann;Bioinformatics,2016

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection;GigaScience;2024