BACKGROUND
Internet-based media stories provide valuable information for emerging risks of product-related child injury prevention and control, but critical methodological challenges and high costs of data acquisition and processing restrict practical use by stakeholders.
OBJECTIVE
To develop an automated data platform for gathering, processing, and transforming textual media stories into structured data that can support identification of new product-related child injury risks, development of research priorities, and modification of prevention policy and practice.
METHODS
The data platform was constructed through literature reviews and multi-round research group discussions. Components developed included standard search strategies, filtering criteria, textual document classification, information extraction standards and a keyword dictionary. Ten thousand manually labelled media stories were used to validate the textual document classification model, which was established using the Bidirectional Encoder Representation from Transformers (BERT). Multiple information extraction methods, all based on natural language processing algorithms, were adopted to extract data for 29 structured variables from media stories. They were evaluated through manually validation of 1,000 media stories about product-related child injury. We mapped the geographic distribution of media sources and media-reported product-related child injury events.
RESULTS
We developed an internet-based product-related child injury textual data platform, IPCITDP, that automatically collects, stores, and processes online media stories concerning product-related child injury in China every day. The IPCITDP is composed of four layers -- data search and acquisition, data processing, data storage, and data application. External validation showed high performance for the BERT textual document classification model we established (accuracy = 0.9703) and the combined information extraction strategies (accuracy > 0.70 for 25 variables). As of December 31, 2022, the IPCITDP collected 28,979 eligible product-related child injury reports from 9,935 news media websites or social media platform accounts which were geographically located in all 31 provinces of mainland China and covered over 97% of the prefecture-level cities. The product-related child injury cases collected by the IPCITDP were typically reported several months or years earlier than official announcements about the product-related child injury risks. The IPCITDP added data concerning 15 supplementary variables that are not covered by the national product-related injury surveillance system. Two examples demonstrate the value of IPCITDP in supplementing additional data and providing early detection of emerging epidemiological signals concerning product-related child injury, one for magnetic beads related child injury and the other for electric self-balancing scooters related child injury.
CONCLUSIONS
The IPCITDP provides product-related child injury data that can support early detection of new product-related child injury characteristics in China and supplement existing data sources to reduce the burden of product-related injury among Chinese children.