Affiliation:
1. University of Pittsburgh, Pittsburgh, PA
2. Google
Abstract
Parsing the semantic structure of a web page is a key component of web information extraction. Successful extraction algorithms usually require large-scale training and evaluation datasets, which are difficult to acquire. Recently, crowdsourcing has proven to be an effective method of collecting large-scale training data in domains that do not require much domain knowledge. For more complex domains, researchers have proposed sophisticated quality control mechanisms to replicate tasks in parallel or sequential ways and then aggregate responses from multiple workers. Conventional annotation integration methods often put more trust in the workers with high historical performance; thus, they are called performance-based methods. Recently, Rzeszotarski and Kittur have demonstrated that behavioral features are also highly correlated with annotation quality in several crowdsourcing applications. In this article, we present a new crowdsourcing system, called Wernicke, to provide annotations for web information extraction. Wernicke collects a wide set of behavioral features and, based on these features, predicts annotation quality for a challenging task domain: annotating web page structure. We evaluate the effectiveness of quality control using behavioral features through a case study where 32 workers annotate 200 Q&A web pages from five popular websites. In doing so, we discover several things: (1) Many behavioral features are significant predictors for crowdsourcing quality. (2) The behavioral-feature-based method outperforms performance-based methods in recall prediction, while performing equally with precision prediction. In addition, using behavioral features is less vulnerable to the cold-start problem, and the corresponding prediction model is more generalizable for predicting recall than precision for cross-website quality analysis. (3) One can effectively combine workers’ behavioral information and historical performance information to further reduce prediction errors.
Publisher
Association for Computing Machinery (ACM)
Subject
Artificial Intelligence,Theoretical Computer Science
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献