Multilingual native language identification-Reference-Cited by-同舟云学术

Multilingual native language identification

Published:2015-12-02 Issue:2 Volume:23 Page:163-215
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

MALMASI SHERVIN,DRAS MARK

Abstract

AbstractWe present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author’s first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference129 articles.

1. Building a large annotated corpus of english: the Penn Treebank;Marcus;Computational Linguistics,1993

2. On the Methods of Measuring Association Between Two Attributes

3. On Association Coefficients for 2×2 Tables and Properties That Do Not Depend on the Marginal Distributions

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Homograph Language Identification Using Machine Learning Techniques;Proceedings of International Conference on Data Science and Applications;2023

2. Natural Language Processing and Language Learning;The Encyclopedia of Applied Linguistics;2021-12-20

3. Exploiting native language interference for native language identification;Natural Language Engineering;2020-11-26

4. Native Language Identification of Fluent and Advanced Non-Native Writers;ACM Transactions on Asian and Low-Resource Language Information Processing;2020-07-31

5. $CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph;IEEE Access;2020