DMDD: A Large-Scale Dataset for Dataset Mentions Detection-Reference-Cited by-同舟云学术

DMDD: A Large-Scale Dataset for Dataset Mentions Detection

Published:2023 Issue: Volume:11 Page:1132-1146
ISSN:2307-387X
Container-title:Transactions of the Association for Computational Linguistics
language:en
Short-container-title:

Author:

Pan Huitong¹,Zhang Qi²,Dragut Eduard³,Caragea Cornelia⁴,Latecki Longin Jan⁵

Affiliation:

1. Temple University, Philadelphia, Pennsylvania, USA. huitong.pan@temple.edu

2. Temple University, Philadelphia, Pennsylvania, USA. qi.zhang@temple.edu

3. Temple University, Philadelphia, Pennsylvania, USA. edragut@temple.edu

4. University of Illinois Chicago, Chicago, Illinois, USA. cornelia@uic.edu

5. Temple University, Philadelphia, Pennsylvania, USA. latecki@temple.edu

Abstract

Abstract The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.

Publisher

MIT Press

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication

Link

https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00592/2159087/tacl_a_00592.pdf

Reference29 articles.

1. The ACE 2005 (ACE 05) evaluation plan evaluation of the detection and recognition of ace entities, values, temporal expressions, relations, and events 1;ACE,2005

2. EmoNet: Fine-grained emotion detection with gated recurrent neural networks;Abdul-Mageed,2017

3. SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications;Augenstein,2017

4. Scibert: A pretrained language model for scientific text;Iz,2019

5. Longformer: The long-document transformer;Iz;arXiv preprint arXiv:2004.05150,2020