Automatic Discovery of Abnormal Values in Large Textual Databases-Reference-Cited by-同舟云学术

Automatic Discovery of Abnormal Values in Large Textual Databases

Published:2016-06-06 Issue:1-2 Volume:7 Page:1-31
ISSN:1936-1955
Container-title:Journal of Data and Information Quality
language:en
Short-container-title:J. Data and Information Quality

Author:

Christen Peter¹,Gayler Ross W.²,Tran Khoi-Nguyen¹,Fisher Jeffrey¹,Vatsalan Dinusha¹

Affiliation:

1. The Australian National University, Acton, Australia

2. Veda

Abstract

Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records. With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services, while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that “normal” values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.

Funder

Australian Research Council

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/2889311

Reference37 articles.

1. Maik Anderka Benno Stein and Nedim Lipka. 2011. Detection of text quality flaws as a one-class classification problem. In CIKM. ACM 2313--2316. 10.1145/2063576.2063954 Maik Anderka Benno Stein and Nedim Lipka. 2011. Detection of text quality flaws as a one-class classification problem. In CIKM. ACM 2313--2316. 10.1145/2063576.2063954

2. Efficient Detection of Unusual Words

3. Automatic segmentation of text into structured records

4. Anomaly detection

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Pattern Masking for Dictionary Matching: Theory and Practice;Algorithmica;2024-03-06

2. Thirty-three myths and misconceptions about population data: from data capture and processing to linkage;INT J POPUL DATA SCI;2023

3. Unsupervised Identification of Abnormal Nodes and Edges in Graphs;Journal of Data and Information Quality;2022-12-28

4. A scoping review of preprocessing methods for unstructured text data to assess data quality;INT J POPUL DATA SCI;2022

5. Unsupervised Anomaly Detection in Knowledge Graphs;Proceedings of the 10th International Joint Conference on Knowledge Graphs;2021-12-06