A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges

Author:

Villate-Castillo Guillermo1,Lorente Javier Del Ser2,Urquijo Borja Sanz3

Affiliation:

1. Tecnalia

2. University of the Basque Country

3. University of Deusto

Abstract

Abstract

The emergence of the transformer architecture has ushered in a new era of possibilities, showcasing remarkable capabilities in generative tasks exemplified by models like GPT4o, Claude 3, and Llama 3. However, these advancements come with a caveat: predominantly trained on data gleaned from social media platforms, these systems inadvertently perpetuate societal biases and toxicity. Recognizing the paramount importance of AI Safety and Alignment, our study embarks on a thorough exploration through a comprehensive literature review focused on toxic language. Delving into various definitions, detection methodologies, and mitigation strategies, we aim to shed light on the complexities of this issue. While our focus primarily centres on transformer-based architectures, we also acknowledge and incorporate existing research within the realm of deep learning. Through our investigation, we uncover a multitude of challenges inherent in toxicity mitigation and detection models. These challenges range from inherent biases and generalization issues to the necessity for standardized definitions of toxic language and the quality assurance of dataset annotations. Furthermore, we emphasize the significance of transparent annotation processes, resolution of annotation disagreements, and the enhancement of Large Language Models (LLMs) robustness. Additionally, we advocate for the creation of standardized benchmarks to gauge the effectiveness of toxicity mitigation and detection methods. Addressing these challenges is not just imperative, but pivotal in advancing the development of safer and more ethically aligned AI systems.

Publisher

Springer Science and Business Media LLC

Reference279 articles.

1. Suler, John (2004) The online disinhibition effect. Cyberpsychology & behavior 7(3): 321--326 Mary Ann Liebert, Inc.

2. Xing, Xiaodan and others (2024) When AI Eats Itself: On the Caveats of Data Pollution in the Era of Generative AI. arXiv preprint arXiv:2405.09597

3. Amit Sheth and others (2022) Defining and detecting toxicity on social media: context and knowledge are key. Neurocomputing 490: 312-318 https://doi.org/https://doi.org/10.1016/j.neucom.2021.11.095, Online platforms have become an increasingly prominent means of communication. Despite the obvious benefits to the expanded distribution of content, the last decade has resulted in disturbing toxic communication, such as cyberbullying and harassment. Nevertheless, detecting online toxicity is challenging due to its multi-dimensional, context sensitive nature. As exposure to online toxicity can have serious social consequences, reliable models and algorithms are required for detecting and analyzing such communication across the vast and growing space of social media. In this paper, we draw on psychological and social theory to define toxicity. Then, we provide an approach that identifies multiple dimensions of toxicity and incorporates explicit knowledge in a statistical learning algorithm to resolve ambiguity across such dimensions., Toxicity, Cursing, Harassment, Extremism, Radicalization, Context, https://www.sciencedirect.com/science/article/pii/S0925231221018087, 0925-2312

4. Lees, Alyssa and others (2022) A New Generation of Perspective API: Efficient Multilingual Character-level Transformers. Association for Computing Machinery, New York, NY, USA, KDD '22, Washington DC, USA, moderation, multilingual, text classification, 11, 3197 –3207, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, On the world wide web, toxic content detectors are a crucial line of defense against potentially hateful and offensive messages. As such, building highly effective classifiers that enable a safer internet is an important research area. Moreover, the web is a highly multilingual, cross-cultural community that develops its own lingo over time. As such, it is crucial to develop models that are effective across a diverse range of languages, usages, and styles. In this paper, we present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings. We additionally outline the techniques employed to make such a byte-level model efficient and feasible for productionization. Through extensive experiments on multilingual toxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity, emoji-based hate, human-readable obfuscation, distribution shift, and bias evaluation settings, we show that our proposed approach outperforms strong baselines. Finally, we present our findings from deploying this system in production., 10.1145/3534678.3539147, https://doi.org/10.1145/3534678.3539147, 9781450393850

5. Jiang, Jiachen (2020) A Critical Audit of Accuracy and Demographic Biases within Toxicity Detection Tools. Dartmouth College Undergraduate Theses

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3