Comparing programming languages for data analytics: Accuracy of estimation in <scp>Python</scp> and <scp>R</scp>-Reference-Cited by-同舟云学术

Comparing programming languages for data analytics: Accuracy of estimation in Python and R

Published:2024-02-02 Issue:3 Volume:14 Page:
ISSN:1942-4787
Container-title:WIREs Data Mining and Knowledge Discovery
language:en
Short-container-title:WIREs Data Min & Knowl

Author:

Hill Chelsey¹,Du Lanqing²,Johnson Marina¹^ORCID,McCullough B. D.²

Affiliation:

1. Feliciano School of Business Montclair State University Montclair New Jersey USA

2. Gerri C. LeBow College of Business Drexel University Philadelphia Pennsylvania USA

Abstract

AbstractSeveral open‐source programming languages, particularly R and Python, are utilized in industry and academia for statistical data analysis, data mining, and machine learning. While most commercial software programs and programming languages provide a single way to deliver a statistical procedure, open‐source programming languages have multiple libraries and packages offering many ways to complete the same analysis, often with varying results. Applying the same statistical method across these different libraries and packages can lead to entirely different solutions due to the differences in their implementations. Therefore, reliability and accuracy should be essential considerations when making library and package usage decisions while conducting statistical analysis using open source programming languages. Instead, most users take this for granted, assuming that their chosen libraries and packages produce accurate results for their statistical analysis. To this extent, this study assesses the estimation accuracy and reliability of Python and R's various libraries and packages by evaluating the univariate summary statistics, analysis of variance (ANOVA), and linear regression procedures using benchmarking data from the National Institutes of Standards and Technology (NIST). Further, experimental results are presented comparing machine learning methods for classification and regression. The libraries and packages assessed in this study include the stats package in R and Pandas, Statistics, NumPy, statsmodels, SciPy, statsmodels, scikit‐learn, and pingouin in Python. The results show that the stats package in R and statsmodels library in Python are reliable for univariate summary statistics. In contrast, Python's scikit‐learn library produces the most accurate results and is recommended for ANOVA. Among the libraries and packages assessed for linear regression, the results demonstrated that the stats package in R is more reliable, accurate, and flexible; thus, it is recommended for linear regression analysis. Further, we present results and recommendations for machine learning using R and Python.This article is categorized under:

Algorithmic Development > Statistics

Application Areas > Data Mining Software Tools

Publisher

Wiley

Link

https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1531

Reference29 articles.

1. The reliability of statistical functions in four software packages freely used in numerical computation

2. Markelle K. Longjohn R. &Nottingham K.The UCI Machine Learning Repository.https://archive.ics.uci.edu

3. The accurary of Mathematica 4 as a statistical package

4. The Numerical Reliability of Econometric Software

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Predicting Entrepreneurial Decisions Using Artificial Intelligence within the Digital Economy Context: A CART Algorithm;Proceedings of the 2024 International Conference on Digital Society and Artificial Intelligence;2024-05-24