Streamlining social media information retrieval for public health research with deep learning

Author:

Hua Yining123ORCID,Wu Jiageng4,Lin Shixu4,Li Minghui4,Zhang Yujie4,Foer Dinah3,Wang Siwen1ORCID,Zhou Peilin5,Yang Jie6,Zhou Li3

Affiliation:

1. Department of Epidemiology, Harvard Chan School of Public Health , Boston, MA 02115, United States

2. Department of Biomedical Informatics, Harvard Medical School , Boston, MA 02115, United States

3. Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School , Boston, MA 02145, United States

4. School of Public Health, Zhejiang University School of Medicine , Hangzhou, Zhejiang, 310058, China

5. Thrust of Data Science and Analytics, The Hong Kong University of Science and Technology (Guangzhou) , Guangzhou, Guangdong, 511458, China

6. Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School , Boston, MA 02120, United States

Abstract

Abstract Objective Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword-matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a Unified Medical Language System (UMLS)-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. Methods COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity samples were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. Results We identified 498 480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18 226. The final dictionary contains 38 175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. Conclusions This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret, and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.

Publisher

Oxford University Press (OUP)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3