Abstract
AbstractObjectiveTo develop and validate advanced natural language processing pipelines that detect 18 conditions in clinical notes written in French, among which 16 comorbidities of the Charlson index, while exploring a collaborative and privacy-preserving workflow.Materials and methodsThe detection pipelines relied both on rule-based and machine learning algorithms for named entity recognition and entity qualification, respectively. We used a large language model pre-trained on millions of clinical notes along with clinical notes annotated in the context of three cohort studies related to oncology, cardiology and rheumatology, respectively. The overall workflow was conceived to foster collaboration between studies while complying to the privacy constraints of the data warehouse. We estimated the added values of both the advanced technologies and the collaborative setting.ResultsThe 18 pipelines reached macro-averaged F1-score positive predictive value, sensitivity and specificity of 95.7 (95%CI 94.5 - 96.3), 95.4 (95%CI 94.0 - 96.3), 96.0 (95%CI 94.0 - 96.7) and 99.2 (95%CI 99.0 - 99.4), respectively. F1-scores were superior to those observed using either alternative technologies or non-collaborative settings. The models were shared through a secured registry.ConclusionsWe demonstrated that a community of investigators working on a common clinical data warehouse could efficiently and securely collaborate to develop, validate and use sensitive artificial intelligence models. In particular, we provided efficient and robust natural language processing pipelines that detect conditions mentioned in clinical notes.
Publisher
Cold Spring Harbor Laboratory
Reference43 articles.
1. High-performance medicine: the convergence of human and artificial intelligence
2. Foundation models for generalist medical artificial intelligence
3. National Science and Technology Concil. National strategy to advance privacy-preserving data sharing and analytics. https://www.whitehouse.gov/wp-content/uploads/2023/03/National-Strategy-to-Advance-Privacy-Preserving-Data-Sharing-and-Analytics.pdf. Accessed: 20-7-2023.
4. Eric Lehman , Evan Hernandez , Diwakar Mahajan , et al. Do we still need clinical language models? arXiv preprint arXiv:2302.08091, 2023.
5. Nicholas Carlini , Florian Tramer , Eric Wallace , et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.