Abstract
AbstractThis study introducesRheumaLinguisticpack(RheumaLpack), the first specialised linguistic web corpus designed for the field of musculoskeletal disorders. By combining web mining (i.e., web scraping) and natural language processing (NLP) techniques, as well as clinical expertise,RheumaLpacksystematically captures and curates structured and unstructured data across a spectrum of web sources including clinical trials registers (i.e.,ClinicalTrials.gov), bibliographic databases (i.e., PubMed), medical agencies (i.e. EMA), social media (i.e., Reddit), and accredited health websites (i.e., MedlinePlus, Harvard Health Publishing, and Cleveland Clinic). Given the complexity of rheumatic and musculoskeletal diseases (RMDs) and their significant impact on quality of life, this resource can be proposed as a useful tool to train algorithms that could mitigate the diseases’ effects. Therefore, the corpus aims to improve the training of artificial intelligence (AI) algorithms and facilitate knowledge discovery in RMDs. The development ofRheumaLpackinvolved a systematic six-step methodology covering data identification, characterisation, selection, collection, processing, and corpus description. The result is a non-annotated, monolingual, and dynamic corpus, featuring almost 3 million records spanning from 2000 to 2023.RheumaLpackrepresents a pioneering contribution to rheumatology research, providing a useful resource for the development of advanced AI and NLP applications. This corpus highlights the value of web data to address the challenges posed by musculoskeletal diseases, illustrating the corpus’s potential to improve research and treatment paradigms in rheumatology. Finally, the methodology shown can be replicated to obtain data from other medical specialities. The code and details on how to buildRheumaL(inguistic)packare also provided to facilitate the dissemination of such resource.
Publisher
Cold Spring Harbor Laboratory
Reference45 articles.
1. Tam Harbert . Tapping the power of unstructured data. https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data, 2021. “Accessed: 2024-02-02”.
2. Forbes Tech Council. The big unstructured data problem. https://www.forbes.com/sites/forbestechcouncil/2017/06/05/the-big-unstructured-data-problem/, 2017. “Accessed: 2024-02-02”.
3. Neural Natural Language Processing for unstructured data in electronic health records: A review
4. Systematic evaluation of research progress on natural language processing in medicine over the past 20 years: bibliometric study on pubmed;Journal of medical Internet research,2020
5. Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . Attention is all you need. Advances in neural information processing systems, 30, 2017.