BACKGROUND
The COVID-19 pandemic, characterized by varying lockdown durations across different nations and overcrowding in healthcare facilities, has introduced novel challenges in the realm of disease forecasting. One of the pressing issues has been the management of missing data stemming from diverse sources
OBJECTIVE
To show how handling missing data can effect estimates of the COVID-19 incidence rate (CIR).
METHODS
The current study used data from the surveillance system of COVID-19/SAR-CoV-2 patients treated at the National Institute of Hygiene and Epidemiology, Hanoi, Vietnam. We randomly removed missing data that were completely at random (MCAR) from 5% to 30% with a break of 5% each time in the variable daily case load of COVID-19. We selected six analytical methods to assess the effects of handling missing data which were backfill imputation, moving average, median imputation, maximum likelihood, linear interpolation, and the Autoregressive integrated moving average (ARIMA) model.
RESULTS
During the Zero-COVID period, the median imputation method yielded lower mean absolute crude bias (ACB) and mean crude root mean square error (RMSE) values compared to the other methods, irrespective of the extent of missing data; the median imputation method exhibited the lowest mean absolute percentage change (APC) in the CIR. During the Transition period, the ARIMA model of imputation demonstrated the lowest mean ACB across all levels of missing data and the lowest mean APC values. During the New-normal period, the backfill and linear interpolation methods demonstrated the lowest mean ACB across all levels of missing data and relatively lower mean APC values compared with the other imputation methods.
CONCLUSIONS
Our study emphasizes the importance of choosing the most appropriate missing data handling method, in the context of a specific disease situation, to ensure reliable estimates of the CIR.