Abstract
AbstractThe original CHESS database of human genes was assembled from nearly 10,000 RNA sequencing experiments in 53 human body sites produced by the Genotype-Tissue Expression (GTEx) project, and then augmented with genes from other databases to yield a comprehensive collection of protein-coding and noncoding transcripts. The construction of the new CHESS 3 database employed improved transcript assembly algorithms, a new machine learning classifier, and protein structure predictions to identify genes and transcripts likely to be functional and to eliminate those that appeared more likely to represent noise. The new catalog contains 41,356 genes on the GRCh38 reference human genome, of which 19,839 are protein-coding, and a total of 158,377 transcripts. These include 14,863 novel protein-coding transcripts. The total number of transcripts is substantially smaller than earlier versions due to improved transcriptome assembly methods and to a stricter protocol for filtering out noisy transcripts. Notably, CHESS 3 contains all of the transcripts in the MANE database, and at least one transcript corresponding to the vast majority of protein-coding genes in the RefSeq and GENCODE databases. CHESS 3 has also been mapped onto the complete CHM13 human genome, which gives a more-complete gene count of 43,773 genes and 19,968 protein-coding genes. The CHESS database is available athttp://ccb.jhu.edu/chess.
Publisher
Cold Spring Harbor Laboratory
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献