BACKGROUND
Pregnancy and gestation information is routinely recorded in the electronic medical records (EMR) systems in China in various datasets. The combination of the two data, i.e. times of pregnancy and times of gestation, implies the incident of abortion and other pregnancy-related issues, which is important for clinical decisions making and personal privacy protection. The distribution of this information inside EMR is variable, due to the inconsistent IT structures of EMR systems, and the quantitative evaluation of the potential exposure of this sensitive information has never been performed at a large scale.
OBJECTIVE
We aim to perform the first nationwide quantitative analysis on the identification sites and exposure frequency of sensitive pregnancy and gestation information to propose strategies for effective information extraction and privacy protection related to women’s health.
METHODS
The data extraction study was performed in a national healthcare data network. Rule-based protocols for pregnancy and gestation information extraction were developed by a committee of experts. Six different sub-datasets of EMRs are used as a schema for data analysis and strategy proposal. The identification sites and the frequency of identification in different sub-datasets were calculated. The manual quality inspection of extraction was then performed by two independent groups of reviewers on 1000 randomly selected records Based on the above statistics, strategies for effective information extraction and privacy protection were proposed.
RESULTS
The data network covers hospitalized patients from 19 hospitals in 9 provinces of China, with a total number of 7,084,339 and a time span of 10 years (2010~2020). 688,268 female patients with sensitive reproductive information (SRI) were identified. The frequencies of the identification were variable, with the marriage history in admission medical records at 62.74% as the highest part. Surprisingly, more than 50% of female patients were identified with pregnancy and gestation history in nursing records, which is not generally considered a sub-dataset rich in reproductive information. In the manual curation and review process, 500 cases were selected randomly. The precision and recall rate of information extraction method both exceeded 99.5%. The privacy-protection strategies were designed with clear technical directions.
CONCLUSIONS
Critical information related to women’s health is recorded in a vast amount in Chinese routine EMR systems and it is distributed in different parts of the records with different frequencies, requiring a thorough protocol to extract and protect the information, which has been demonstrated technically feasible. Implementing a data-based strategy will help enforce the protection of women’s privacy and improve the accessibility of healthcare services.