Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches-Reference-Cited by-同舟云学术

Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches

Published:2024-01-26 Issue:1 Volume:19 Page:e0296929
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Garcia Klaifer,Shiguihara Pedro,Berton Lilian^ORCID

Abstract

Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.

Funder

Universidad San Ignacio de Loyola

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Publisher

Public Library of Science (PLoS)

Reference55 articles.

1. Filloux F. Google News the secret sauce. The Guardian; 2013 Feb 25 [Cited 2022 July 29]. Available from: https://www.theguardian.com/technology/2013/feb/25/1.

2. Sentiment Analysis Based on Deep Learning: A Comparative Study;NC Dang;Electronics,2020

3. Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA;K Garcia;Applied Soft Computing,2021

4. Fake News Classification Based on Content Level Features;CM Lai;Applied Sciences,2022

5. Supervised Learning for Fake News Detection;JCS Reis;IEEE Intelligent Systems,2019