BACKGROUND
Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.
OBJECTIVE
In the present study, we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth employing Natural Language Processing (NLP) transformer models to enhance the process.
METHODS
A total of 538 health-related webpages, covering seven different disease subjects, were manually selected by Factually Health Company. The process included the following steps: i) using a Bidirectional Encoder Representations from Transformers (BERT) model, the contents of webpages were classified into three thematic categories: semiology, epidemiology, and management. ii) for each category in the webpages, a PubMed query was automatically produced using a combination of the “WellcomeBertMesh” and “KeyBERT” models, iii) top 20 related literatures were automatically extracted from PubMed and finally, iv) the similarity checking techniques of Cosine similarity and Jaccard distance were applied to compare the content of extracted literature and webpages.
RESULTS
The BERT model for categorization of webpages contents had a good performance with the F1-scores and recall of 93% and 94% for the semiology and epidemiology respectively and 96% of for both the recall and F1-score for management. For each of the three categories in a webpage, one PubMed query was generated and with each query, 20 most related, open access and within the category of systematic reviews and meta-analysis were extracted. Less than 10% of the extracted literature were irrelevant, which were deleted. For each webpage, an average number of 23% of the sentences found to be very similar to the literature. Moreover, during the evaluation, it was found that Cosine similarity outperformed the Jaccard Distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared to accurate similarities as some sentences had a similarity score exceeding 80%, but they could not be considered as similar sentences.
CONCLUSIONS
In the present research, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain