Abstract
AbstractMotivationHuman traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping.ResultsIn our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity.Availability and ImplementationOur code is available at https://github.com/MRCIEU/vectology.
Publisher
Cold Spring Harbor Laboratory
Reference56 articles.
1. UK Biobank. About UK Biobank. https://www.ukbiobank.ac.uk/about-biobank-uk (2014).
2. The Million Women Study: design and characteristics of the study population
3. Out Future Health. https://ourfuturehealth.org.uk/.
4. Million Veteran Program (MVP). https://www.research.va.gov/mvp/.
5. China Kadoorie Biobank. https://www.ckbiobank.org/site/.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献