Abstract
Features extraction methods, such as k-mer-based methods, have recently made up a significant role in classifying and analyzing approaches for metagenomics data. But, they are challenged by various bottlenecks, such as performance limitations, high memory consumption, and computational overhead. To deal with these challenges, we developed an innovative features extraction and sequence profiling method for DNA/RNA sequences, called PC-mer, taking advantage of the physicochemical properties of nucleotides. PC-mer in comparison with the k-mer profiling methods provides a considerable memory usage reduction by a factor of 2k while improving the metagenomics classification performance, for both machine learning-based and computational-based methods, at the various levels and also archives speedup more than 1000x for the training phase. Examining ML-based PC-mer on various datasets confirms that it can achieve 100% accuracy in classifying samples at the class, order, and family levels. Despite the k-mer-based classification methods, it also improves genus-level classification accuracy by more than 14% for shotgun dataset (i.e. achieves accuracy of 97.5%) and more than 5% for amplicon dataset (i.e. achieves accuracy of 98.6%). Due to these improvements, we provide two PC-mer-based tools, which can actually replace the popular k-mer-based tools: one for classifying and another for comparing metagenomics data.
Publisher
Public Library of Science (PLoS)
Reference27 articles.
1. A new profiling approach for DNA sequences based on the nucleotides’ physicochemical features for accurate analysis of SARS-CoV-2 genomes;S. Akbari Rokn Abadi;BMC Genomics,2023
2. HELIOS: High-speed sequence alignment in optics;E. Maleki;PLoS Comput Biol,2022
3. An automated ultra-fast, memory-efficient, and accurate method for viral genome classification;S. Akbari Rokn Abadi;J Biomed Inform,2023
4. WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs;S. Akbari Rokn Abadi;PLoS One,2022
5. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study;G. S. Randhawa;PLoS One,2020