Affiliation:
1. JSC Tilde IT, Jasinskio Str. 12, LT-01112 Vilnius, Lithuania
2. Department of Applied Informatics, Vytautas Magnus University, Universiteto Str. 10, Akademija, LT-53361 Kaunas, Lithuania
3. JSC Novian Pro, Gynėjų Str. 14, LT-01109 Vilnius, Lithuania
Abstract
This study aims to address challenges in media monitoring by enhancing closed-set topic classification in multilingual contexts (where both training and testing occur in several languages) and crosslingual contexts (where training is in English and testing spans all languages). To achieve this goal, we utilized a dataset from the European Media Monitoring webpage, which includes approximately 15,000 article titles across 18 topics in 58 different languages spanning a period of nine months from May 2022 to March 2023. Our research conducted comprehensive comparative analyses of nine approaches, encompassing a spectrum of embedding techniques (word, sentence, and contextual representations) and classifiers (trainable/fine-tunable, memory-based, and generative). Our findings reveal that the LaBSE+FFNN approach achieved the best performance, reaching macro-averaged F1-scores of 0.944 ± 0.015 and 0.946 ± 0.019 in both multilingual and crosslingual scenarios. LaBSE+FFNN’s similar performance in multilingual and crosslingual scenarios eliminates the need for machine translation into English. We also tackled the open-set topic classification problem by training a binary classifier capable of distinguishing between known and new topics with the average loss of ∼0.0017 ± 0.0002. Various feature types were investigated, reaffirming the robustness of LaBSE vectorization. The experiments demonstrate that, depending on the topic, new topics can be identified with accuracies above ∼0.796 and of ∼0.9 on average. Both closed-set and open-set topic classification modules, along with additional mechanisms for clustering new topics to organize and label them, are integrated into our media monitoring system, which is now used by our real client.
Reference43 articles.
1. Harro-Loit, H., and Eberwein, T. (2024). News Media Monitoring Capabilities in 14 European Countries: Problems and Best Practices. Media Commun., 12.
2. Grizāne, A., Isupova, M., and Vorteil, V. (2022). Social Media Monitoring Tools: An In-Depth Look, NATO Strategic Communications Centre of Excellence.
3. Habernal, I., and Matoušek, V. (2013). Proceedings of the Text, Speech, and Dialogue, Pilsen, Czech Republic, 1–5 September 2013, Springer.
4. Steinberger, R. (2013). Multilingual and Cross-Lingual News Analysis in the Europe Media Monitor (EMM), Spinger.
5. Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili;Steinberger;Lang. Resour. Eval.,2011
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献