Affiliation:
1. Max Planck Institute for Informatics, Germany. ssinghan@mpi-inf.mpg.de"
2. Max Planck Institute for Informatics, Germany. srazniew@mpi-inf.mpg.de"
3. Max Planck Institute for Informatics, Germany. weikum@mpi-inf.mpg.de"
Abstract
Abstract
This paper presents a new task of predicting the coverage of a text document for relation extraction (RE): Does the document contain many relational tuples for a given entity? Coverage predictions are useful in selecting the best documents for knowledge base construction with large input corpora. To study this problem, we present a dataset of 31,366 diverse documents for 520 entities. We analyze the correlation of document coverage with features like length, entity mention frequency, Alexa rank, language complexity, and information retrieval scores. Each of these features has only moderate predictive power. We employ methods combining features with statistical models like TF-IDF and language models like BERT. The model combining features and BERT, HERB, achieves an F1 score of up to 46%. We demonstrate the utility of coverage predictions on two use cases: KB construction and claim refutation.
Subject
Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication
Reference39 articles.
1. Negative statements considered useful;Arnaout;Journal of Web Semantics,2021
2. Latent dirichl et allocation;Blei;Journal of Machine Learning Research,2003
3. Seeing things from a different angle: Discovering diverse perspectives about claims;Chen,2019
4. Deeper text understanding for IR with contextual neural language modeling;Dai,2019
5. Completeness statements about RDF data sources and their use for query answering;Darari,2013
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献