Author:
Karim Mohimenul,Abid Rashid
Abstract
AbstractSpecific gene regions in DNA, such as cytochrome c oxidase I (COI) in animals, are defined as DNA barcodes and can be used as identifiers to distinguish species. The standard length of a DNA barcode is approximately 650 base pairs (bp). However, because of the challenges associated with sequencing technologies and the unavailability of high-quality genomic DNA, it is not always possible to obtain the full-length barcode sequence of an organism. Recent studies suggest that mini-barcodes, which are shorter (100-300 bp) barcode sequences, can contribute significantly to species identification. Among various methods proposed for the identification task, supervised machine learning methods are effective. However, any prior work indicating the efficacy of mini-barcodes in species identification under a machine learning approach is elusive to find. In this study, we analyzed the effect of different barcode lengths on species identification using supervised machine learning and proposed a general approximation of the required length of the minibarcode. Since Naïve Bayes is seen to generally outperform other supervised methods in species identification in other studies, we implemented this classifier and showed the effectiveness of the mini-barcode by demonstrating the accuracy responses obtained after varying the length of the DNA barcode sequences.
Publisher
Cold Spring Harbor Laboratory