DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation-Reference-Cited by-同舟云学术

DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation

Published:2023-09-12 Issue:3 Volume:48 Page:1-40
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Leventidis Aristotelis¹^ORCID,Di Rocco Laura¹^ORCID,Gatterbauer Wolfgang¹^ORCID,Miller Renée J.¹^ORCID,Riedewald Mirek¹^ORCID

Affiliation:

1. Northeastern University, USA

Abstract

Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduce DomainNet , which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNet leverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 for DomainNet , which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3612919

Reference104 articles.

1. MEANS: A medical question-answering system combining NLP techniques and semantic Web technologies

2. A comparative survey of recent natural language interfaces for databases

3. Robust multilingual Named Entity Recognition with shallow semi-supervised features

4. A survey of topic modeling in text mining;Alghamdi Rubayyi;Int. J. Adv. Comput. Sci. Appl.,2015

5. Overlapping Community Discovery Methods: A Survey