Affiliation:
1. Integrasco AS, Hasseldalen 3, 4878 Grimstad, Norway
2. School of Computer Science, Carleton University, Ottawa, Canada K1S 5B6, Canada
3. Department of ICT, University of Agder, Grimstad, Norway
Abstract
This paper deals with the problems of language detection and tracking in multilingual online short word-of-mouth (WoM) discussions. This problem is particularly unusual and difficult from a pattern recognition perspective because, in these discussions, the participants and content involve the opinions of users from all over the world. The nature of these discussions, consisting of multiple topics in different languages, presents us with a problem of finding training and classification strategies when the class-conditional distributions are nonstationary. The difficulties in solving the problem are many-fold. First of all, the analyst has no knowledge of when one language stops and when the next starts. Further, the features which one uses for any one language (for example, the n-grams) will not be valid to recognize another. Finally, and most importantly, in most real-life applications, such as in WoM, the fragments of text available before the switching, are so small that it renders any meaningful classification using traditional estimation methods almost futile. Earlier, the authors [B. J. Oommen and L. Rueda, Patt. Recogn.39(1) (2006) 328–341.] had recommended that for a variety of problems, the use of strong estimators (i.e. estimators that converge with probability 1) is sub-optimal. In this vein, we propose to solve the current problem using novel estimators that are pertinent for nonstationary environments. The classification results obtained for various data sets which involve as many as eight languages demonstrates that our proposed methodology is both powerful and efficient.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献