Affiliation:
1. NICTA Victoria Research Laboratory and The University of Melbourne, Australia
2. The University of Melbourne, Australia
Abstract
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this article, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalizing lexical variants. Our method uses a classifier to detect lexical variants, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.
Funder
Australian Research Council
Communication and Digital Economy
Publisher
Association for Computing Machinery (ACM)
Subject
Artificial Intelligence,Theoretical Computer Science
Reference56 articles.
1. Brants T. and Franz A. 2006. Web 1T 5-gram Version 1. Brants T. and Franz A. 2006. Web 1T 5-gram Version 1.
2. An improved error model for noisy channel spelling correction
Cited by
86 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献