Automated Bayesian variable selection methods for binary regression models with missing covariate data-Reference-Cited by-同舟云学术

Automated Bayesian variable selection methods for binary regression models with missing covariate data

Published:2024-09-13 Issue: Volume: Page:
ISSN:1863-8155
Container-title:AStA Wirtschafts- und Sozialstatistisches Archiv
language:en
Short-container-title:AStA Wirtsch Sozialstat Arch

Author:

Bergrab Michael^ORCID,Aßmann Christian^ORCID

Abstract

AbstractData collection and the availability of large data sets has increased over the last decades. In both statistical and machine learning frameworks, two methodological issues typically arise when performing regression analysis on large data sets. First, variable selection is crucial in regression modeling, as it helps to identify an appropriate model with respect to the considered set of conditioning variables. Second, especially in the context of survey data, handling of missing values is important for estimation, which occur even with state-of-the-art data collection and processing methods. Within this paper, we provide an Bayesian approach based on a spike-and-slab prior for the regression coefficients, which allows for simultaneous handling of variable selection and estimation in combination with handling of missing values in covariate data. The paper also discusses the implementation of the approach using Markov chain Monte Carlo techniques and provides results for simulated data sets and an empirical illustration based on data from the German National Educational Panel Study. The suggested Bayesian approach is compared to other statistical and machine learning frameworks such as Lasso, ridge regression, and Elastic net, and is shown to perform well in terms of estimation performance and variable selection accuracy. The simulation results demonstrate that ignoring the handling of missing values in data sets can lead to the generation of biased selection results. Overall, the proposed Bayesian method offers a holistic, flexible, and powerful framework for variable selection in the presence of missing covariate data.

Funder

Leibniz-Institut für Bildungsverläufe e.V.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11943-024-00345-1.pdf

Reference76 articles.

1. Albert JH (1992) Bayesian estimation of normal ogive item response curves using Gibbs sampling. J Educ Stat 17(3):251–269. https://doi.org/10.2307/1165149

2. Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88(422):669–679. https://doi.org/10.1080/01621459.1993.10476321

3. Aßmann C (2012) Determinants and costs of current account reversals under heterogeneity and serial correlation. Appl Econ 44(13):1685–1700. https://doi.org/10.1080/00036846.2011.554370

4. Aßmann C, Boysen-Hogrefe J (2011) A Bayesian approach to model-based clustering for binary panel probit models. Comput Stat Data Anal 55(1):261–279. https://doi.org/10.1016/j.csda.2010.04.016

5. Aßmann C, Gaasch JC, Stingl D (2023) A Bayesian approach towards missing covariate data in multilevel latent regression models. Psychometrika 88:1495–1528. https://doi.org/10.1007/s11336-022-09888-0