Abstract
AbstractNatural language processing (NLP) is increasingly being applied to obtain unsupervised representations of electronic healthcare record (EHR) data, but their performance for the prediction of clinical endpoints remains unclear. Here we use primary care EHRs from 6,286,233 people with Multiple Long-Term Conditions in England to generate vector representations of sequences of disease development using two input strategies (212 disease categories versus 9,462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec and two transformer models designed for EHRs). We also develop a new transformer architecture, named EHR-BERT, which incorporates socio-demographic information. We then compare use of each of these representations to predict mortality, healthcare use and new disease diagnosis. We find that representations generated using disease categories perform similarly to those using diagnostic codes, suggesting models can equally manage smaller or larger vocabularies. Sequence-based algorithms perform consistently better than bag-of-words methods, with the highest performance for EHR-BERT.
Publisher
Cold Spring Harbor Laboratory
Reference36 articles.
1. Defining and measuring multimorbidity: a systematic review of systematic reviews
2. Multimorbidity—a defining challenge for health systems
3. Map clusters of diseases to tackle multimorbidity
4. The Academy of Medical Sciences. Multimorbidity: a priority for global health research. Academy of Medical Sciences (2018).
5. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit;Med,2021