<scp>BERTuit</scp>: Understanding Spanish language in Twitter with transformers-Reference-Cited by-同舟云学术

BERTuit: Understanding Spanish language in Twitter with transformers

Published:2023-07-24 Issue:9 Volume:40 Page:
ISSN:0266-4720
Container-title:Expert Systems
language:en
Short-container-title:Expert Systems

Author:

Huertas‐Tato Javier¹^ORCID,Martín Alejandro¹,Camacho David¹^ORCID

Affiliation:

1. Departamento de Informática Universidad Politécnica de Madrid Madrid Spain

Abstract

AbstractThe appearance of complex attention‐based language models such as BERT, RoBERTa or GPT‐3 has allowed to address highly complex tasks in a plethora of scenarios. However, when applied to specific domains, these models encounter considerable difficulties. This is the case of Social Networks such as Twitter, an ever‐changing stream of information written with informal and complex language, where each message requires careful evaluation to be understood even by humans given the important role that context plays. Addressing tasks in this domain through Natural Language Processing involves severe challenges. When powerful state‐of‐the‐art multilingual language models are applied to this scenario, language specific nuances get lost in translation. To face these challenges we present BERTuit, the largest transformer proposed so far for Spanish language, pre‐trained on a massive dataset of 230 M Spanish tweets using RoBERTa optimization. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network, with special emphasis on solutions devoted to tackle the spreading of misinformation in this platform. BERTuit is evaluated on several tasks and compared against M‐BERT, XLM‐RoBERTa and XLM‐T, very competitive multilingual transformers. The utility of our approach is shown with applications, in this case: an unsupervised methodology to visualize groups of hoaxes; and supervised profiling of authors spreading disinformation.

Funder

Ministerio de Ciencia e Innovación

Comunidad de Madrid

European Commission

Publisher

Wiley

Subject

Artificial Intelligence,Computational Theory and Mathematics,Theoretical Computer Science,Control and Systems Engineering

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.13404

Reference52 articles.

1. Transformer-Based Word Embedding With CNN Model to Detect Sarcasm and Irony

2. MIss RoBERTa WiLDe: Metaphor Identification Using Masked Language Model with Wiktionary Lexical Definitions

3. Barbieri F. Anke L. E. &Camacho‐Collados J.(2021).Xlm‐t: A multilingual language model toolkit for twitter arXiv preprint arXiv:2104.12250.

4. Baviera Puig T. Calvo D. &Llorca‐Abad G.(2019).Twitter dataset‐2015 spanish general election Universitat Politècnica de València.

5. A survey on fake news and rumour detection techniques

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Regionalized models for Spanish language variations based on Twitter;Language Resources and Evaluation;2023-03-02