COHEWL: Classifying and Measuring SemanticCoherence of Short Texts with Language Models-Reference-Cited by-同舟云学术

COHEWL: Classifying and Measuring SemanticCoherence of Short Texts with Language Models

Published:2024-08-30 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Junior Osmar de Oliveira Braz¹,Fileto Renato²

Affiliation:

1. State Univ. of Santa Catarina (UDESC)

2. Federal University of Santa Catarina (UFSC)

Abstract

Traditional text coherence models are unable to detect incoherences caused by word misuse in single-sentence documents, as they focus on sentence ordering and semantic similarity of neighboring sentences.This work investigates methods to classify and measure semantic consistency of words in very short documents. Firstly, we fine-tuned BERT for the tasks of detecting short documents with an incoherent word, and distinguishing original documents from the ones with a word automatically changed by the BERT Masked Language Model (MLM). We also used BERT embeddings to calculate coherence measures.Then we prompted generative Large Language Models (LLMs) to classify and measure semantic coherence.The classifiers based on BERT achieved between \(80%\) and \(87.50%\) accuracy in the task of classifying semantic coherence, depending on the language. They performed even better in the task of distinguishing original documents from the ones with a word changed. However, coherence measures calculated using BERT embeddings did not discriminate well coherent documents from incoherent ones, neither original documents from their respective versions with a word automatically changed.On the other hand, LLaMA, GPT, and Gemini outperformed BERT in the task of semantic coherence classification on our corpus of short questions about data structures, in Portuguese and in English. They also generated semantic coherence measures that discriminate coherent from incoherent documents better than measures based on BERT embeddings.

Publisher

Springer Science and Business Media LLC

Reference58 articles.

1. Aletras, Nikolaos and Stevenson, Mark (2013) Evaluating topic coherence using distributional semantics. Association for Computational Linguistics, Potsdam, Germany, https://aclanthology.org/W13-0102/, 13--22, Proceedings of the 10th international conference on computational semantics (IWCS 2013)

2. Barzilay, Regina and Lee, Lillian (2004) Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. Association for Computational Linguistics, Boston, Massachusetts, USA, https://aclanthology.org/N04-1015, 113--120, may 2 - may 7", Proceedings of the Human Language Technology Conference of the North {A}merican Chapter of the Association for Computational Linguistics: {HLT}-{NAACL} 2004

3. Barzilay, Regina and Lapata, Mirella (2008) Modeling local coherence: An entity-based approach. Computational Linguistics 34(1): 1-34 https://doi.org/10.1162/coli.2008.34.1.1, 0891-2017, MIT Press, 03

4. Bao, Mengjiao and Li, Jianxin and Zhang, Jian and Peng, Hao and Liu, Xudong (2019) Learning Semantic Coherence for Machine Generated Spam Text Detection. IEEE, Budapest, Hungary, 10.1109/IJCNN.2019.8852340, 1--8, 2019 Intl. Joint Conf. on Neural Networks

5. De Beaugrande, Robert-Alain and Dressler, Wolfgang U. (1981) Introduction to Text Linguistics. Longman, London, 1