Optimizing word embeddings for small datasets: a case study on patient portal messages from breast cancer patients-Reference-Cited by-同舟云学术

Optimizing word embeddings for small datasets: a case study on patient portal messages from breast cancer patients

Published:2024-07-12 Issue:1 Volume:14 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Song Qingyuan,Ni Congning,Warner Jeremy L.,Chen Qingxia,Song Lijun,Rosenbloom S. Trent,Malin Bradley A.,Yin Zhijun

Abstract

AbstractPatient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. We introduce a novel adaptation of the word2vec model, PK-word2vec (where PK stands for prior knowledge), for small-scale messages. PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec in a case study of patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. The dataset was composed of 1389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7981 non-medical and 1116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p = 0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks’ choices between the two groups of reviewers (

$${\text{p}} = 0.774$$

p = 0.774 under a paired t-test). PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41598-024-66319-z.pdf

Reference32 articles.

1. Dendere, R. et al. Patient portals facilitating engagement with inpatient electronic medical records: A systematic review. J. Med. Internet Res. 21(4), e12779. https://doi.org/10.2196/12779 (2019).

2. Goel, M. S. et al. Patient reported barriers to enrolling in a patient portal. J. Am. Med. Inf. Assoc. 18(1), i8–i12 (2011).

3. Kruse, C. S., Bolton, K. & Freriks, G. The effect of patient portals on quality outcomes and its implications to meaningful use: A systematic review. J. Med. Internet Res. 17(2), e44 (2015).

4. Ralston, J. D. et al. Patient web services integrated with a shared medical record: Patient use and satisfaction. J. Am. Med. Inf. Assoc. 14(6), 798–806 (2007).

5. Osborn, C. Y. et al. MyHealthAtVanderbilt: Policies and procedures governing patient portal functionality. J. Am. Med. Inf. Assoc. 18(1), i18–i23 (2011).