Double-pass clustering technique for multilingual document collections-Reference-Cited by-同舟云学术

Double-pass clustering technique for multilingual document collections

Published:2011-05-09 Issue:3 Volume:37 Page:304-321
ISSN:0165-5515
Container-title:Journal of Information Science
language:en
Short-container-title:Journal of Information Science

Author:

Kishida Kazuaki¹

Affiliation:

1. Keio University, Japan,

Abstract

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Link

http://journals.sagepub.com/doi/pdf/10.1177/0165551511404867

Reference32 articles.

1. A file organization and maintenance procedure for dynamic document collections

2. J. Rasmussen E. Clustering algorithm. In: Frakes WB and Baeza-Yates R (eds) Information retrieval: data structures & algorithms. Englewood Cliffs, NJ : PTR Prentice Hall, 1992, pp. 419-442.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering;International Journal of Pattern Recognition and Artificial Intelligence;2015-02-27

2. Cross-language patent matching via an international patent classification-based concept bridge;Journal of Information Science;2013-07-08

3. Probability-based text clustering algorithm by alternately repeating two operations;Journal of Information Science;2013-01-29