Affiliation:
1. University of Illinois Chicago
2. University of Michigan
Abstract
Potential harms from the under-representation of minorities in data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge.
With recent generative AI advancements, large language and foundation models have emerged as versatile tools across various domains. In this paper, we propose Chameleon, a system that efficiently utilizes these tools to augment a dataset with minimal addition of synthetically generated tuples to enhance the coverage of the under-represented groups. Our system applies quality and outlier-detection tests to ensure the quality and semantic integrity of the generated tuples. In order to minimize the rejection chance of the generated tuples, we propose multiple strategies to provide a guide for the foundation model. Our experiment results, in addition to confirming the efficiency of our proposed algorithms, illustrate our approach's effectiveness, as the model's unfairness in a downstream task significantly dropped after data repair using Chameleon.
Publisher
Association for Computing Machinery (ACM)
Reference90 articles.
1. Chiara Accinelli, Barbara Catania, Giovanna Guerrini, and Simone Minisi. 2021. The impact of rewriting on coverage constraint satisfaction.. In EDBT Workshops.
2. Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-based Rewriting for Data Preparation. In EDBT Workshops.
3. Rakesh Agrawal Sreenivas Gollapudi Alan Halverson and Samuel Ieong. 2009. Diversifying search results. In WSDM. ACM 5--14.
4. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes;Arora Simran;PVLDB,2023
5. Assessing and Remedying Coverage for a Given Dataset