BACKGROUND
The secondary use of medical claims data, electronic medical records (EMRs), and electronic health records (EHRs) for clinical epidemiology research is overgrowing in Japan. Because these data are not collected for research purposes, secondary use requires understanding their limitations and the ability to generate clinical questions, epidemiological skills to construct a study design, and statistical skills to analyze retrospective observational data. Previous approaches have guided the limitations and challenges of using these data in observational clinical epidemiology research. However, knowledge of statistical skills for secondary use of these data is also essential. Therefore, we performed an exhaustive literature review of the nationwide existing studies based on these data to clarify how these data were applied in clinical epidemiological research.
OBJECTIVE
With an investigation of the existing studies based on claims, EMRs, and EHRs data in Japan, we aimed to learn: (1) what statistical methods were used; (2) in what disease areas were these data being used; (3) how frequently these data types were used; (4) which databases were used; (5) what kind of studies were designed; (6) whether these studies were conducted by academic institutions; and (7) what outcomes were assessed.
METHODS
We obtained articles based on claims, EMRs, and EHRs data by searching PubMed up to June 30, 2021 (the date of search). Eligible articles were then filtered based on the inclusion and exclusion criteria. Finally, we manually extracted the seven categories of information from full-texts of the target articles.
RESULTS
Results collected from the 620 target articles suggested that (1) most of the studies have been done by academic institutes (69.2%); (2) cohort study was the primary design that longitudinally measured outcomes of proper patients (86%), (3) 95.8% of studies have used claims data; (4) the JMDC (29.2%), DPC database (MHLW) (22.7%), MDV (16.6%), and NDB (10.5%) were the most used; (5) infections (16.9%), cardiovascular diseases (16.1%), neoplasms (12.6%), and nutritional and metabolic diseases (12.1%) were the most studied; (6) treatment patterns (35.2%), physiological/clinical (29.7%) and mortality (22.1%) were the most assessed outcomes; (7) multivariate models were commonly used (66.8%). In those studies which multivariate models were implemented, most of them were done for confounder adjustment (90.6%). Logistic regression was shown to be the first choice for assessing many of the outcomes, with exception of hospitalization/hospital stay and resource use/costs, for both of which linear regression was commonly used. In addition, some studies used propensity analysis to balance patient backgrounds between groups, from which we found a tendency for propensity score analysis to assess patient mortality.
CONCLUSIONS
Our findings provided a good view of the current status and trends in statistically analyzing these data in clinical epidemiology research. We also expected that these results would serve as reference information to help researchers design appropriate studies for secondary use of claims, EMRs, and EHRs data in clinical research.