Affiliation:
1. Indian Institute of Technology, Bombay
2. University of Washington, Seattle and Indian Institute of Technology, Bombay
Abstract
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems.
We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems,Software
Cited by
44 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献