Word Embedding for the French Natural Language in Health Care: Comparative Study-Reference-Cited by-同舟云学术

Word Embedding for the French Natural Language in Health Care: Comparative Study

Published:2019-07-29 Issue:3 Volume:7 Page:e12310
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Dynomant Emeric^ORCID,Lelong Romain^ORCID,Dahamna Badisse^ORCID,Massonnaud Clément^ORCID,Kerdelhué Gaétan^ORCID,Grosjean Julien^ORCID,Canu Stéphane^ORCID,Darmoni Stefan J^ORCID

Abstract

Background Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. Objective The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. Methods Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. Results Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. Conclusions Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.

Publisher

JMIR Publications Inc.

Subject

Health Information Management,Health Informatics

Reference35 articles.

1. Accuracy of using natural language processing methods for identifying healthcare-associated infections

2. The SMART retrieval system: Experiments in automatic document processing — Gerard Salton, Ed. (Englewood Cliffs, N.J.: Prentice-Hall, 1971, 556 pp., $15.00)

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England;Communications Medicine;2024-05-29

2. Extracting White-Box Knowledge from Word Embedding: Modeling as an Optimization Problem;Lecture Notes in Computer Science;2024

3. Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review;JMIR Medical Informatics;2023-12-15

4. Do Japanese word-embedded representations obtained in the academic corpus retain the medical concepts of “infarction”?;Artificial Intelligence in Medicine;2023-09

5. Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England;2023-06-30