Affiliation:
1. Carnegie Mellon University, USA
2. NEC Laboratories Europe, Germany
3. Center for Advanced Interdisciplinary Research, Ss. Cyril and Methodius Uni. of Skopje, Germany
Abstract
Abstract
Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.1
Reference40 articles.
1. A survey of text clustering algorithms;Aggarwal,2012
2. k-means++: the advantages of careful seeding;Arthur,2007
3. Local algorithms for interactive clustering;Awasthi;Journal of Machine Learning Research,2013
4. Interactive clustering: A comprehensive review;Bae;ACM Computing Surveys,2020
5. Open information extraction from the web;Banko,2007
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献