Affiliation:
1. Dipartimento di Lingue, Letterature, Culture e Mediazioni, Università degli Studi di Milano , Milano, Italy
2. Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche , Pisa, Italy
Abstract
AbstractNative language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e. that of analysing the internals of an NLI classifier trained by an explainable machine learning (EML) algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ‘give a speaker’s native language away’. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e. guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners’ essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker’s L1; our experiments indicate that the most discriminative features are the lexical ones, followed by the morphological, syntactic, and statistical features, in this order. We also present two case studies, one on Italian and one on Spanish learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s; we show that the traits identified as most discriminative well align with our intuition, i.e. represent typical patterns of language misuse, underuse, or overuse, by speakers of the given L1. Overall, our study shows that the use of EML can be a valuable tool for the scholar who investigates interlanguage facts and language transfer.
Publisher
Oxford University Press (OUP)
Subject
Computer Science Applications,Linguistics and Language,Language and Linguistics,Information Systems
Reference74 articles.
1. Advances in Corpus-based Contrastive Linguistics
2. Lexical bundles in learner writing: an analysis of formulaic language in the ALESS learner corpus;Allen;Komaba Journal of English Education,2010
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献