Abstract
Electronic medical records (EMRs) include many valuable data about patients, which is, however, unstructured. Therefore, there is a lack of both labeled medical text data in Russian and tools for automatic annotation. As a result, today, it is hardly feasible for researchers to utilize text data of EMRs in training machine learning models in the biomedical domain. We present an unsupervised approach to medical data annotation. Syntactic trees are produced from initial sentences using morphological and syntactical analyses. In retrieved trees, similar subtrees are grouped using Node2Vec and Word2Vec and labeled using domain vocabularies and Wikidata categories. The usage of Wikidata categories increased the fraction of labeled sentences 5.5 times compared to labeling with domain vocabularies only. We show on a validation dataset that the proposed labeling method generates meaningful labels correctly for 92.7% of groups. Annotation with domain vocabularies and Wikidata categories covered more than 82% of sentences of the corpus, extended with timestamp and event labels 97% of sentences got covered. The obtained method can be used to label EMRs in Russian automatically. Additionally, the proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabulary.
Funder
Ministry of Science and Higher Education of the Russian Federation
Reference32 articles.
1. Deep EHR: Chronic Disease Prediction Using Medical Notes;Liu;Proceedings of the 3rd Machine Learning for Healthcare Conference,2018
2. The Unified Medical Language System (UMLS): integrating biomedical terminology
3. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
4. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program;Aronson;Proceedings of the AMIA Annual Symposium,2001
5. Automatic annotation of medical records in spanish with disease, drug and substance names;Oronoz;Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),2013
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献