Author:
Wu Xuan,Zhao Yizheng,Yang Yang,Liu Zhangdaihong,Clifton David A.
Abstract
AbstractObjectiveTo compare and release the diagnosis (ICD-10-CM), procedure (ICD-10-PCS), and medication (NDC) concept (code) embeddings trained by Latent Dirichlet Allocation (LDA), Word2Vec, GloVe, and BERT, for more efficient electronic health record (EHR) data analysis.Materials and MethodsThe embeddings were pre-trained by the four aforementioned models separately using the diagnosis, procedure, and medication information in MIMIC-IV. We interpreted the embeddings by visualizing them in 2D space and used the silhouette coefficient to assess the clustering ability of these embeddings. Furthermore, we evaluated the embeddings in three downstream tasks without fine-tuning: next visit diagnoses prediction, ICU patients mortality prediction, and medication recommendation.ResultsWe found that embeddings pre-trained by GloVe have the best performance in the downstream tasks and the best interpretability for all diagnosis, procedure, and medication codes. In the next-visit diagnosis prediction, the accuracy of using GloVe embeddings was 12.2% higher than the baseline, which is the random generator. In the other two prediction tasks, GloVe improved the accuracy by 2%-3% over the baseline. LDA, Word2Vec, and BERT marginally improved the results over the baseline in most cases.Discussion and ConclusionGloVe shows superiority in mining diagnoses, procedures, and medications information of MIMIC-IV compared with LDA, Word2Vec, and BERT. Besides, we found that the granularity of training samples can affect the performance of models according to the downstream task and pre-train data.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献