Affiliation:
1. Institute of Data Science, Maastricht University, Maastricht, The Netherlands
Abstract
While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In particular, metadata are useful to understand the nature and provenance of the data. A common approach to improving the quality of metadata relies on expensive human curation, which itself is time-consuming and also prone to error. Towards improving the quality of metadata, we use scientific publications to automatically predict metadata key:value pairs. For prediction, we use a Convolutional Neural Network (CNN) and a Bidirectional Long-short term memory network (BiLSTM). We focus our attention on the NCBI Disease Corpus, which is used for training the CNN and BiLSTM. We perform two different kinds of experiments with these two architectures: (1) we predict the disease names by using their unique ID in the MeSH ontology and (2) we use the tree structures of MeSH ontology to move up in the hierarchy of these disease terms, which reduces the number of labels. We also perform various multi-label classification techniques for the above-mentioned experiments. We find that in both cases CNN achieves the best results in predicting the superclasses for disease with an accuracy of 83%.
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems and Management,Information Systems
Reference20 articles.
1. NCBI GEO: archive for functional genomics data sets—update
2. A neural probabilistic language model;Bengio Yoshua;Journal of Machine Learning Research 3,2003
3. The conundrum of sharing research data
4. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data
5. François Chollet et al. 2015. Keras. https://keras.io. François Chollet et al. 2015. Keras. https://keras.io.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献