Investigating cross-lingual training for offensive language detection-Reference-Cited by-同舟云学术

Investigating cross-lingual training for offensive language detection

Published:2021-06-25 Issue: Volume:7 Page:e559
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Pelicon Andraž¹²,Shekhar Ravi³,Škrlj Blaž¹²,Purver Matthew¹³^ORCID,Pollak Senja¹

Affiliation:

1. Jožef Stefan Institute, Ljubljana, Slovenia

2. Jožef Stefan International Postgraduate School, Ljubljana, Slovenia

3. Queen Mary University of London, London, United Kingdom

Abstract

Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of cross-lingual transfer learning to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes.

Funder

European Union’s Horizon

European Union’s Rights, Equality and Citizenship Program

EPSRC

Slovenian Research Agency

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-559.pdf

Reference78 articles.

1. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond;Artetxe;Transactions of the Association for Computational Linguistics,2019

2. RuG@ EVALITA 2018: hate speech detection in Italian social media;Bai,2018

3. CrotoneMilano for AMI at Evalita2018: a performant, cross-lingual misogyny detection system;Basile,2018

4. SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in Twitter;Basile,2019

5. Ethnic cleansing in Myanmar: the Rohingya crisis and human rights;Beyrer;The Lancet,2017

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey on multi-lingual offensive language detection;PeerJ Computer Science;2024-03-29

2. A Comprehensive Review on Transformers Models For Text Classification;2023 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC);2023-09-27

3. Investigating toxicity changes of cross-community redditors from 2 billion posts and comments;PeerJ Computer Science;2022-08-18

4. BERT Models for Arabic Text Classification: A Systematic Review;Applied Sciences;2022-06-04