Author:
Huo Tianyao,Glueck Deborah H.,Shenkman Elizabeth A.,Muller Keith E.
Abstract
AbstractAlthough superficially similar to data from clinical research, data extracted from electronic health records may require fundamentally different approaches for model building and analysis. Because electronic health record data is designed for clinical, rather than scientific use, researchers must first provide clear definitions of outcome and predictor variables. Yet an iterative process of defining outcomes and predictors, assessing association, and then repeating the process may increase Type I error rates, and thus decrease the chance of replicability, defined by the National Academy of Sciences as the chance of “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.”[1] In addition, failure to account for subgroups may mask heterogeneous associations between predictor and outcome by subgroups, and decrease the generalizability of the findings. To increase chances of replicability and generalizability, we recommend using a stratified split sample approach for studies using electronic health records. A split sample approach divides the data randomly into an exploratory set for iterative variable definition, iterative analyses of association, and consideration of subgroups. The confirmatory set is used only to replicate results found in the first set. The addition of the word ‘stratified’ indicates that rare subgroups are oversampled randomly by including them in the exploratory sample at higher rates than appear in the population. The stratified sampling provides a sufficient sample size for assessing heterogeneity of association by testing for effect modification by group membership. An electronic health record study of the associations between socio-demographic factors and uptake of hepatic cancer screening, and potential heterogeneity of association in subgroups defined by gender, self-identified race and ethnicity, census-tract level poverty and insurance type illustrates the recommended approach.
Funder
Agency for Healthcare Research and Quality
Patient-Centered Outcomes Research Institute
National Center for Advancing Translational Sciences
National Institutes of Health
Publisher
Springer Science and Business Media LLC
Subject
Health Informatics,Epidemiology
Reference23 articles.
1. Reproducibility and Replicability in Science. Washington, D.C.:National Academies Press; 2019.
2. Picard RR, Cook RD. Cross-validation of regression models. J Am Stat Assoc. 1984;79:575–83.
3. Häyrinen K, Saranto K, Nykänen P. Definition, structure, content, use and impacts of electronic health records: a review of the research literature. Int J Med Informatics. 2008;77:291–304.
4. Callahan A, Shah NH, Chen JH. Research and reporting considerations for observational studies using electronic health record data. Ann Intern Med. 2020;172 11Supplement:79–84.
5. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC). 2013;1:1035.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献