Sandy: A user-friendly and versatile NGS simulator to facilitate sequencing assay design and optimization-Reference-Cited by-同舟云学术

Sandy: A user-friendly and versatile NGS simulator to facilitate sequencing assay design and optimization

Published:2023-08-27 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Miller Thiago L. A.,Conceição Helena B.,Mercuri Rafael L.,Santos Felipe R. C.,Barreiro Rodrigo,Buzzo José Leonel,Rego Fernanda O.,Guardia Gabriela,Galante Pedro A. F.^ORCID

Abstract

ABSTRACTNext-generation sequencing (NGS) is currently the gold standard technique for large-scale genome and transcriptome studies. However, the downstream processing of NGS data is a critical bottleneck that requires difficult decisions regarding data analysis methods and parameters. Simulated or synthetic NGS datasets are practical and cost-effective alternatives for overcoming these difficulties. Simulated NGS datasets have known true values and provide a standardized scenario for driving the development of data analysis methodologies and tuning cut-off values. Although tools for simulating NGS data are available, they have limitations in terms of their overall usability and documentation. Here, we present Sandy, an open-source simulator that generates synthetic reads that mimic DNA or RNA next-generation sequencing on the Illumina, Oxford Nanopore, and Pacific Bioscience platforms. Sandy is designed to be user-friendly, computationally efficient, and capable of simulating data resembling a wide range of features of real NGS assays, including sequencing quality, genomic variations, and gene expression profiles per tissue. To demonstrate Sandy’s versatility, we used it to address two critical questions in designing an NGS assay: (i) How many reads should be sequenced to ensure unbiased analysis of gene expression in an RNA sequencing run? (ii) What is the lowest genome coverage required to identify most (90%) of the single nucleotide variants and structural variations in whole-genome sequencing? In summary, Sandy is an ideal tool for assessing and validating pipelines for processing, optimizing results, and defining the costs of NGS assays. Sandy runs on Linux, MacOS, and Microsoft Windows and can provide feasible results, even on personal computers. Availability: Sandy is freely available athttps://galantelab.github.io/sandy.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. A broad survey of DNA sequence data simulation tools;Brief Funct Genomics,2020

2. Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim

3. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery;BMC Genomics,2022

4. Fitting the Negative Binomial Distribution to Biological Data

5. Near-optimal probabilistic RNA-seq quantification