Abstract
AbstractIn this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task—the CsFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFactsNLI. We analyze both acquired datasets for spurious cues—annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.
Funder
Technologická Agentura Ceské Republiky
Czech Technical University in Prague
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences,Linguistics and Language,Education,Language and Linguistics
Reference54 articles.
1. Althobaiti, M. J. (2022). A simple yet robust algorithm for automatic extraction of parallel sentences: A case study on arabic-english wikipedia articles. IEEE Access, 10, 401–420. https://doi.org/10.1109/ACCESS.2021.3137830
2. Aly, R., Guo, Z., Schlichtkrull, M. S., Thorne, J., Vlachos, A., Christodoulopoulos, C., Cocarascu, O., & Mittal, A. (2021). FEVEROUS: Fact extraction and VERification over unstructured and structured information. In 35th Conference on neural information processing systems datasets and benchmarks track (Round 1). https://openreview.net/forum?id=h-flVCIlstW
3. Augenstein, I., Lioma, C., Wang, D., Chaves Lima, L., Hansen, C., Hansen, C., & Simonsen, J. G. (2019). MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In (Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 4685–4697). Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1475. https://aclanthology.org/D19-1475
4. Beltagy, I., Peters, M. E., & Cohan, A. (2020) Longformer: The long-document transformer. arXiv:2004.05150
5. Binau, J., & Schulte, H. (2020). Danish fact verification: An end-to-end machine learning system for automatic fact-checking of danish textual claims. Thesis, IT University of Copenhagen.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献