Author:
Giles Oliver,Huntley Rachael,Karlsson Anneli,Lomax Jane,Malone James
Abstract
AbstractThe COVID-19 Open Research Dataset (CORD-19) was released in March 2020 to allow the machine learning and wider research community to develop techniques to answer scientific questions on COVID-19. The dataset consists of a large collection of scientific literature, including over 100,000 full text papers. Annotating training data to normalise variability in biological entities can improve the performance of downstream analysis and interpretation. To facilitate and enhance the use of the CORD-19 data in these applications, in late March 2020 we performed a comprehensive annotation process using named entity recognition tool, TERMite, along with a number of large reference ontologies and vocabularies including domains of genes, proteins, drugs and virus strains. The additional annotation has identified and tagged over 45 million entities within the corpus made up of 62,746 unique biomedical entities. The latest updated version of the annotated data, as well as older versions, is made openly available under GPL-2.0 License for the community to use at: https://github.com/SciBiteLabs/CORD19
Publisher
Cold Spring Harbor Laboratory
Reference10 articles.
1. The White House. Call to action to the tech community on new machine readable covid-19 dataset, 2020. URL https://www.whitehouse.gov/briefings-statements/call-action-tech-community-new-machine-readable-covi Accessed: 2020-03-30.
2. M. Arora and V. Kansal . Character level embedding with deep convolutional neural network for text normalization of unstructured data for twitter sentiment analysis. Social Network Analysis and Mining, 9(12), 2019.
3. O. Giles , A. Karlsson , S. Masiala , S. White , G. Cesareni , L. Perfetto , J. Mullen , M. Hughes , L. Harland , and J. Malone . Optimising biomedical relationship extraction with biobert: Best practices for data creation. bioRxiv, 2020. doi: https://doi.org/10.1101/2020.09.01.277277.
4. Smaili, F. Z. , Gao X. , and R. Hoehndorf . Self-normalizing learning on biomedical ontologies using a deep siamese neural network. bioRxiv, 2020. doi: https://doi.org/10.1101/2020.04.23.057117.
5. E. M. Hechenbleikner , D. V. Samarov , and E. Lin . A call for collaboration with the tech industry data scrutiny. EClinicalMedicine, 23, 2020. doi: https://doi.org/10.1016/j.eclinm.2020.100377.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献