BACKGROUND
Globalization and environmental changes have increased the emergence and re-emergence of infectious diseases worldwide. The collaboration of regional infectious disease surveillance systems is critical but difficult to achieve because of the different transparency levels of health information sharing systems among countries. ProMED-mail is the most comprehensive expert-curated platform that provides rich outbreak information among humans, animals, and plants from different countries. However, owing to unstructured text content in reports, it is difficult to analyze them for further applications. Therefore, we have devised an idea to develop an automatic summary of the alerting articles from ProMED-mail. In this research, we propose a text summarization method that uses natural language processing to extract important sentences automatically from alert articles in ProMED emails to generate summaries of dengue outbreaks in Southeast Asia. Our method, can be used to capture crucial information quickly and make decisions for epidemic surveillance.
OBJECTIVE
To generate automatic summaries of unstructured text content from reports.
METHODS
Our materials come from the ProMED-mail website, spanning a period from 1994 to 2019. The collected data were annotated by professionals to establish a unique Taiwan dengue corpus through, which achieved almost perfect agreement (90% Cohen’s Kappa statistic). To generate a ProMED-mail summary, we developed a dual-channel bidirectional long-short term memory with an attention mechanism that infuses latent syntactic features to identify crucial sentences from the alerting articles.
RESULTS
Our method is superior to many well-known machine learning and neural network approaches in identifying important sentences, achieving a macro average F1-score of 93%. Moreover, the method can successfully extract key information about dengue fever outbreaks in ProMED-mail, and help researchers or public health practitioners to capture important summaries quickly. Besides verifying the model, we also recruited five professional experts and five students from related fields to carry out a satisfaction survey on the generated summary. The results showed that 83.6% of the summaries received high satisfaction ratings.
CONCLUSIONS
The proposed approach successfully fuses latent syntactic features into a deep neural network to analyze syntactic, semantic, and content information in the text. It then exploits the derived information to identify the crucial sentences in ProMED-mail. The experimental results show that the proposed method is effective and outperforms the comparisons. In addition, our method demonstrated the potential for summary generation from ProMED-mail. When a new alerting article arrives, public health decision makers can identify the outbreak information in a lengthy article quickly and deliver immediate responses to disease control and prevention.
CLINICALTRIAL
NA