Comparing neural sentence encoders for topic segmentation across domains: not your typical text similarity task-Reference-Cited by-同舟云学术

Comparing neural sentence encoders for topic segmentation across domains: not your typical text similarity task

Published:2023-11-03 Issue: Volume:9 Page:e1593
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Ghinassi Iacopo¹,Wang Lin¹^ORCID,Newell Chris²,Purver Matthew¹³^ORCID

Affiliation:

1. School of Electronic Engineering and Computer Science, Queen Mary University of London, London, United Kingdom

2. BBC R&D, London, United Kingdom

3. Jožef Stefan Institute, Ljubljana, Slovenia

Abstract

Neural sentence encoders (NSE) are effective in many NLP tasks, including topic segmentation. However, no systematic comparison of their performance in topic segmentation has been performed. Here, we present such a comparison, using supervised and unsupervised segmentation models based on NSEs. We first compare results with baselines, showing that the use of NSEs does often provide improvements, except for specific domains such as news shows. We then compare over three different datasets a range of existing NSEs and a new NSE based on ad hoc pre-training strategy. We show that existing literature documenting general performance gains of NSEs does not always conform to the results obtained by the same NSEs in topic segmentation. If Transformers-based encoders do improve over previous approaches, fine-tuning in sentence similarity tasks or even on the same topic segmentation task we aim to solve does not always equate to better performance, as results vary across method being used and domains of application. We aim to explain this phenomenon and the relative poor performance of NSEs in news shows by considering how well different NSEs encode the underlying lexical cohesion of same-topic segments; to do so, we introduce a new metric, ARP. The results from this study suggest that good topic segmentation results do not always rely on good cohesion modelling on behalf of the segmenter and that is dependent upon what kind of text we are trying to segment. Also, it appears evident that traditional sentence encoders fail to create topically cohesive clusters of segments when used on conversational data. Overall, this work advances our understanding of the use of NSEs in topic segmentation and of the general factors determining the success (or failure) of a topic segmentation system. The new proposed metric can quantify the lexical cohesion of a multi-topic document under different sentence encoders and, as such, might have many different uses in future research, some of which we suggest in our conclusions.

Funder

Slovenian Research Agency via research core funding for the programme Knowledge Technologies

UK EPSRC via the projects Sodestream

ARCIDUCA

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-1593.pdf

Reference73 articles.

1. Text segmentation based on semantic word embeddings;Alemi;ArXiv,2015

2. Top2vec: distributed representations of topics;Angelov;ArXiv,2020

3. SECTOR: a neural model for coherent topic segmentation and classification;Arnold;Transactions of the Association for Computational Linguistics,2019

4. Statistical models for text segmentation;Beeferman;Machine Learning,1999

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Applications of Large Language Models in Pathology;Bioengineering;2024-03-31