Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare-Reference-Cited by-同舟云学术

Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare

Published:2023-07-16 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Zack Travis^ORCID,Lehman Eric,Suzgun Mirac,Rodriguez Jorge A.,Celi Leo Anthony,Gichoya Judy,Jurafsky Dan,Szolovits Peter,Bates David W.,Abdulnour Raja-Elie E.,Butte Atul J.,Alsentzer Emily^ORCID

Abstract

AbstractBackgroundLarge language models (LLMs) such as GPT-4 hold great promise as transformative tools in healthcare, ranging from automating administrative tasks to augmenting clinical decision- making. However, these models also pose a serious danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care.MethodsUsing the Azure OpenAI API, we tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain—namely, medical education, diagnostic reasoning, plan generation, and patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in healthcare. GPT-4 estimates of the demographic distribution of medical conditions were compared to true U.S. prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups.FindingsWe find that GPT-4 does not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardized clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and gender identities. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception.InterpretationOur findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools like GPT-4 for every intended use case before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies prior to clinical implementation.

Publisher

Cold Spring Harbor Laboratory

Reference59 articles.

1. OpenAI. ChatGPT (2023).

2. OpenAI. GPT-4 Technical Report (2023).

3. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

4. Bartlett, J . Massachusetts hospitals, doctors, medical groups to pilot chatgpt technology. The Boston Globe (2023).

5. Kolata, G . Doctors Are Using Chatbots in an Unexpected Way. The New York Times (2023).

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?;Computer Methods and Programs in Biomedicine;2024-10

2. Medical Ethics of Large Language Models in Medicine;NEJM AI;2024-06-27

3. "I'm Sorry, but I Can't Assist": Bias in Generative AI;Proceedings of the 2024 on RESPECT Annual Conference;2024-05-16

4. Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness;Journal of Biomedical Informatics;2024-05

5. Mixed methods assessment of the influence of demographics on medical advice of ChatGPT;Journal of the American Medical Informatics Association;2024-04-29