Evaluating Plant Gene Models Using Machine Learning-Reference-Cited by-同舟云学术

Evaluating Plant Gene Models Using Machine Learning

Published:2022-06-20 Issue:12 Volume:11 Page:1619
ISSN:2223-7747
Container-title:Plants
language:en
Short-container-title:Plants

Author:

Upadhyaya Shriprabha R.^ORCID,Bayer Philipp E.^ORCID,Tay Fernandez Cassandria G.^ORCID,Petereit Jakob^ORCID,Batley Jacqueline^ORCID,Bennamoun Mohammed^ORCID,Boussaid Farid,Edwards David^ORCID

Abstract

Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Pisum sativum Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.

Funder

Australian Research Council

Publisher

MDPI AG

Subject

Plant Science,Ecology,Ecology, Evolution, Behavior and Systematics

Link

https://www.mdpi.com/2223-7747/11/12/1619/pdf

Reference41 articles.

1. Representation and participation across 20 years of plant genome sequencing

2. Plant pan-genomes are the new reference

3. Genes and gene models, an important distinction

4. What is a gene, post-ENCODE? History and updated definition

5. The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Genomics‐based plant disease resistance prediction using machine learning;Plant Pathology;2024-08-29

2. Technological Development and Advances for Constructing and Analyzing Plant Pangenomes;Genome Biology and Evolution;2024-04

3. Plant Protein Classification Using K-mer Encoding;Computational Intelligence and Network Systems;2023-12-16

4. Unravelling inversions: Technological advances, challenges, and potential impact on crop breeding;Plant Biotechnology Journal;2023-11-14