Pipeline and dataset generation for automated fact-checking in almost any language-Reference-Cited by-同舟云学术

Pipeline and dataset generation for automated fact-checking in almost any language

Published:2024-08-02 Issue: Volume: Page:
ISSN:0941-0643
Container-title:Neural Computing and Applications
language:en
Short-container-title:Neural Comput & Applic

Author:

Drchal Jan^ORCID,Ullrich Herbert,Mlynář Tomáš,Moravec Václav

Abstract

AbstractThis article presents a pipeline for automated fact-checking leveraging publicly available language models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules—the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the question answering for claim generation method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results. We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise

$${\mathcal {V}}$$

V -information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.

Funder

Technology Agency of the Czech Republic

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s00521-024-10113-5.pdf

Reference86 articles.

1. Guo Z, Schlichtkrull M, Vlachos A (2022) A survey on automated fact-checking. Trans Assoc Comput Linguist 10:178–206. https://doi.org/10.1162/tacl_a_00454

2. Nørregaard J, Derczynski L (2021) DanFEVER: claim verification dataset for Danish. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 422–428. Linköping University Electronic Press, Sweden, Reykjavik, Iceland. https://aclanthology.org/2021.nodalida-main.47

3. Ullrich H, Drchal J, Rýpar M, Vincourová H, Moravec V (2023) CsFEVER and CTKFacts: acquiring czech data for fact verification. Lang Resour Eval 1–35

4. Pan L, Chen W, Xiong W, Kan M-Y, Wang WY (2021) Zero-shot fact verification by claim generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 476–483

5. Chen J, Kim G, Sriram A, Durrett G, Choi E (2023) Complex claim verification with evidence retrieved in the wild. arXiv preprint arXiv:2305.11859