The Automatic Detection of Dataset Names in Scientific Articles-Reference-Cited by-同舟云学术

The Automatic Detection of Dataset Names in Scientific Articles

Published:2021-08-04 Issue:8 Volume:6 Page:84
ISSN:2306-5729
Container-title:Data
language:en
Short-container-title:Data

Author:

Heddes Jenny,Meerdink Pim,Pieters Miguel,Marx Maarten

Abstract

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.

Funder

Nederlandse Organisatie voor Wetenschappelijk Onderzoek

Publisher

MDPI AG

Subject

Information Systems and Management,Computer Science Applications,Information Systems

Link

https://www.mdpi.com/2306-5729/6/8/84/pdf

Reference60 articles.

1. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

2. Researcher Perspectives on Publication and Peer Review of Data

3. Identifying and Improving Dataset References in Social Sciences Full Texts;Ghavimi;arXiv,2016