Affiliation:
1. Built Environment Solutions unit, Finnish Environment Institute (Syke) , Latokartanonkaari 11 , Helsinki , Finland
Abstract
Abstract
Purpose
The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model.
Design/methodology/approach
The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison.
Findings
The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal.
Research limitations
Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed.
Practical implications
Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function.
Originality/value
This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.
Reference24 articles.
1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov, F. Csaki (Eds.), Proceedings of the Second International Symposium on Information Theory (pp. 267-281). Budapest: Akademiai Kiado.
2. Ayers, K. L., Cordell, H. J. (2010). SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology, 34(8), 879-891.
3. Bejaei, M., Wiseman, K., Cheng, K. M. (2015). Developing logistic regression models using purchase attributes and demographics to predict the probability of purchases of regular and specialty eggs. British Poultry Science, 56(4), 425-435.
4. Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note. The American Statistician, 36(3a), 153-157.
5. Cavanaugh, J. E. (1997). Unifying the derivations for the Akaike and corrected Akaike information criteria. Statistics Probability Letters, 33(2), 201-208.