A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges-Reference-Cited by-同舟云学术

A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges

Published:2024-07-15 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Villate-Castillo Guillermo¹,Lorente Javier Del Ser²,Urquijo Borja Sanz³

Affiliation:

1. Tecnalia

2. University of the Basque Country

3. University of Deusto

Abstract

The emergence of the transformer architecture has ushered in a new era of possibilities, showcasing remarkable capabilities in generative tasks exemplified by models like GPT4o, Claude 3, and Llama 3. However, these advancements come with a caveat: predominantly trained on data gleaned from social media platforms, these systems inadvertently perpetuate societal biases and toxicity. Recognizing the paramount importance of AI Safety and Alignment, our study embarks on a thorough exploration through a comprehensive literature review focused on toxic language. Delving into various definitions, detection methodologies, and mitigation strategies, we aim to shed light on the complexities of this issue. While our focus primarily centres on transformer-based architectures, we also acknowledge and incorporate existing research within the realm of deep learning. Through our investigation, we uncover a multitude of challenges inherent in toxicity mitigation and detection models. These challenges range from inherent biases and generalization issues to the necessity for standardized definitions of toxic language and the quality assurance of dataset annotations. Furthermore, we emphasize the significance of transparent annotation processes, resolution of annotation disagreements, and the enhancement of Large Language Models (LLMs) robustness. Additionally, we advocate for the creation of standardized benchmarks to gauge the effectiveness of toxicity mitigation and detection methods. Addressing these challenges is not just imperative, but pivotal in advancing the development of safer and more ethically aligned AI systems.

Publisher

Springer Science and Business Media LLC

Reference279 articles.

1. Suler, John (2004) The online disinhibition effect. Cyberpsychology & behavior 7(3): 321--326 Mary Ann Liebert, Inc.

2. Xing, Xiaodan and others (2024) When AI Eats Itself: On the Caveats of Data Pollution in the Era of Generative AI. arXiv preprint arXiv:2405.09597

3. Amit Sheth and others (2022) Defining and detecting toxicity on social media: context and knowledge are key. Neurocomputing 490: 312-318 https://doi.org/https://doi.org/10.1016/j.neucom.2021.11.095, Online platforms have become an increasingly prominent means of communication. Despite the obvious benefits to the expanded distribution of content, the last decade has resulted in disturbing toxic communication, such as cyberbullying and harassment. Nevertheless, detecting online toxicity is challenging due to its multi-dimensional, context sensitive nature. As exposure to online toxicity can have serious social consequences, reliable models and algorithms are required for detecting and analyzing such communication across the vast and growing space of social media. In this paper, we draw on psychological and social theory to define toxicity. Then, we provide an approach that identifies multiple dimensions of toxicity and incorporates explicit knowledge in a statistical learning algorithm to resolve ambiguity across such dimensions., Toxicity, Cursing, Harassment, Extremism, Radicalization, Context, https://www.sciencedirect.com/science/article/pii/S0925231221018087, 0925-2312

4. Lees, Alyssa and others (2022) A New Generation of Perspective API: Efficient Multilingual Character-level Transformers. Association for Computing Machinery, New York, NY, USA, KDD '22, Washington DC, USA, moderation, multilingual, text classification, 11, 3197 –3207, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, On the world wide web, toxic content detectors are a crucial line of defense against potentially hateful and offensive messages. As such, building highly effective classifiers that enable a safer internet is an important research area. Moreover, the web is a highly multilingual, cross-cultural community that develops its own lingo over time. As such, it is crucial to develop models that are effective across a diverse range of languages, usages, and styles. In this paper, we present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings. We additionally outline the techniques employed to make such a byte-level model efficient and feasible for productionization. Through extensive experiments on multilingual toxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity, emoji-based hate, human-readable obfuscation, distribution shift, and bias evaluation settings, we show that our proposed approach outperforms strong baselines. Finally, we present our findings from deploying this system in production., 10.1145/3534678.3539147, https://doi.org/10.1145/3534678.3539147, 9781450393850

5. Jiang, Jiachen (2020) A Critical Audit of Accuracy and Demographic Biases within Toxicity Detection Tools. Dartmouth College Undergraduate Theses