Data Science Using OpenAI: Testing Their New Capabilities Focused on Data Science


Guerra Pires JorgeORCID


Introduction: Despite the ubiquity of statistics in numerous academic disciplines, including life sciences, many researchers–who are not statistically trained–struggle with the correct application of statistical analysis, leading to fundamental errors in their work. The complexity and importance of statistics in scientific research necessitate a tool that empowers researchers from various backgrounds to conduct sound statistical analysis without being experts in the field. This paper introduces and evaluates the potential of OpenAI's latest API, known as the "coder interpreter," to fulfill this need. Methods: The coder interpreter API is designed to comprehend human commands, process CSV data files, and perform statistical analyses by intelligently selecting appropriate methods and libraries. Unlike traditional statistical software, this API simplifies the analysis process by requiring minimal input from the user—often just a straightforward question or command. Our work involved testing the API with actual datasets to demonstrate its capabilities, focusing on ease of use for non-statisticians and investigating its potential to improve research output, particularly in evidence-based medicine. Results: The coder interpreter API effectively utilized open-source Python libraries, renowned for their extensive resources in data science, to accurately execute statistical analyses on provided datasets. Practical examples, including a study involving diabetic patients, showcased the API's proficiency in aiding non-expert researchers in interpreting and utilizing data for their research. Discussion: Integrating AI-based tools such as OpenAI's coder interpreter API into the research process can revolutionize how scientific data is analyzed. By reducing the barrier to conducting advanced statistics, it enables researchers—including those in fields where practitioners are often concurrently medical doctors, such as in evidence-based medicine—to focus on substantive research questions. This paper highlights the potential for these tools to be adopted broadly by both novices and experts alike, thereby improving the overall quality of statistical analysis in scientific research. We advocate for the wider implementation of this technology as a step towards democratizing access to sophisticated statistical inference and data analysis capabilities.


Qeios Ltd

Reference37 articles.

1. HAO, K. The chaos inside OpenAI – Sam Altman, Elon Musk, and existential risk explained — Karen Hao. Big Think [YouTube Channel], 2023. Accessed on 2 Dec 2023. Disponível em: ⟨⟩.

2. WOLFRAM, S. What Is ChatGPT Doing... and Why Does It Work? 2023.

3. PIRES, J. G. O mercado da criatividade: Regulamentação da profissão de pesquisador acadêmico e científico no Brasil. 2023. Disponível em: ⟨ C3%A7%C3%A3o-pesquisador-cient%C3%ADfico-ebook/dp/B09TKRX5PW⟩.

4. BULLARD, K. M. et al. Prevalence of diagnosed diabetes in adults by diabetes type — united states, 2016. Morbidity and Mortality Weekly Report, US Department of Health and Human Services, Centers for Disease Control and Prevention, v. 67, n. 12, p. 359, 2018.

5. TITUS, A. J. Nhanes-gpt: Large language models (llms) and the future of biostatistics. medRxiv, Cold Spring Harbor Laboratory Press, 2023. Disponível em: ⟨⟩.







Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3