Abstract
The emergence of the transformer architecture has ushered in a new era of possibilities, showcasing remarkable capabilities in generative tasks exemplified by models like GPT4o, Claude 3, and Llama 3. However, these advancements come with a caveat: predominantly trained on data gleaned from social media platforms, these systems inadvertently perpetuate societal biases and toxicity. Recognizing the paramount importance of AI Safety and Alignment, our study embarks on a thorough exploration through a comprehensive literature review focused on toxic language. Delving into various definitions, detection methodologies, and mitigation strategies, we aim to shed light on the complexities of this issue. While our focus primarily centres on transformer-based architectures, we also acknowledge and incorporate existing research within the realm of deep learning. Through our investigation, we uncover a multitude of challenges inherent in toxicity mitigation and detection models. These challenges range from inherent biases and generalization issues to the necessity for standardized definitions of toxic language and the quality assurance of dataset annotations. Furthermore, we emphasize the significance of transparent annotation processes, resolution of annotation disagreements, and the enhancement of Large Language Models (LLMs) robustness. Additionally, we advocate for the creation of standardized benchmarks to gauge the effectiveness of toxicity mitigation and detection methods. Addressing these challenges is not just imperative, but pivotal in advancing the development of safer and more ethically aligned AI systems.