Affiliation:
1. Indian Institute of Technology, Kharagpur, India
Abstract
Relation classification (sometimes called
relation extraction
) requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well served by public datasets. In response, we present
IndoRE
, a dataset with 21K entity- and relation-tagged gold sentences in three Indian languages (Bengali, Hindi, and Telugu), plus English. We start with a multilingual BERT (mBERT)-based system that captures entity span positions and type information, and provides competitive performance on monolingual relation classification. Using this baseline system, we explore transfer mechanisms between languages and the scope to reduce expensive data annotation while achieving reasonable relation extraction performance. Specifically, we
(a)
study the accuracy-efficiency trade-off between expensive, manually labeled gold instances vs. automatically translated and aligned silver instances to train a relation extractor,
(b)
device a simple mechanism for budgeted gold data annotation by intelligently converting distant-supervised silver training instances to gold training instances with human annotators using active learning, and finally
(c)
propose an ensemble model to provide a performance boost over that achieved via limited gold training instances.
We release the dataset for future research.
1
Publisher
Association for Computing Machinery (ACM)
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献