Author:
Chen James L,Liu Yang,Sam Lee T,Li Jianrong,Lussier Yves A
Abstract
Abstract
Background
Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function.
Results
Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations.
Conclusion
Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology
Reference13 articles.
1. Jimenez-Sanchez G, Childs B, Valle D: Human disease genes. Nature 2001, 409(6822):853–855. 10.1038/35057050
2. Shannon CE: A Mathematical Theory of Communication. Bell System Technical Journal 1948, 27(3):379–423.
3. Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283. 10.1093/bioinformatics/btg153
4. Resnik P: Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence: 1995; Montreal 1995, 448–453.
5. Lin D: An information-theoretic definition of similarity. 15th International Conference on Machine Learning: 1998; San Francisco 1998, 296–304.
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献