Abstract
AbstractAn accurate genome at the chromosome level is the key to unraveling the mysteries of gene function and unlocking the mechanisms of disease. Irrespective of the sequencing methodology adopted, Hi-C aided scaffolding serves as a principal avenue for generating genome assemblies at the chromosomal level. However, the results of such scaffolding are often flawed and require extensive manual refinement. In this paper, we introduce AutoHiC, an innovative deep learning-based tool designed to identify and rectify genome assembly errors. Diverging from conventional approaches, AutoHiC harnesses the power of high-dimensional Hi-C data to enhance genome continuity and accuracy through a fully automated workflow and iterative error correction mechanism. AutoHiC was trained on Hi-C data from more than 300 species (approximately five hundred thousand interaction maps) in DNA Zoo and NCBI. Its confusion matrix results show that the average error detection accuracy is over 90%, and the area under the precision-recall curve is close to 1, making it a powerful error detection capability. The benchmarking results demonstrate AutoHiC’s ability to substantially enhance genome continuity and significantly reduce error rates, providing a more reliable foundation for genomics research. Furthermore, AutoHiC generates comprehensive result reports, offering users insights into the assembly process and outcomes. In summary, AutoHiC represents a breakthrough in automated error detection and correction for genome assembly, effectively promoting more accurate and comprehensive genome assemblies.
Publisher
Cold Spring Harbor Laboratory