Creating Relational Data from Unstructured and Ungrammatical Data Sources-Reference-Cited by-同舟云学术

Creating Relational Data from Unstructured and Ungrammatical Data Sources

Published:2008-03-28 Issue: Volume:31 Page:543-590
ISSN:1076-9757
Container-title:Journal of Artificial Intelligence Research
language:
Short-container-title:jair

Author:

Michelson M.,Knoblock C. A.

Abstract

In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration difficult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search difficult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a "reference set." The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and efficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.

Publisher

AI Access Foundation

Subject

Artificial Intelligence

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Two-stage Detection of Semantic Redundancies in RDF Data;Journal of Web Engineering;2023-03-19

2. Probabilistic Methods for Enhancing Foreground Segmentation of Various Data Model using Big Data Model;International Journal of Advanced Research in Science, Communication and Technology;2023-01-23

3. Natural Language Interface for Covid-19 Amharic Database Using LSTM Encoder Decoder Architecture with Attention;2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA);2021-11-22

4. An Intelligent System for Identifying Influential Words in Real-Estate Classifieds;Journal of Intelligent Systems;2018-03-28

5. Linking Heterogeneous Data in the Semantic Web Using Scalable and Domain-Independent Candidate Selection;IEEE Transactions on Knowledge and Data Engineering;2017-01-01