Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages-Reference-Cited by-同舟云学术

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

Published:2019-01-08 Issue:1 Volume:18 Page:1-27
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Bhattacharya Paheli¹,Goyal Pawan¹,Sarkar Sudeshna¹

Affiliation:

1. Indian Institute of Technology Kharagpur, West Bengal, India

Abstract

We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR. We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better-quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source nor the target languages. This is probably because having more similar words across multiple languages helps define well-defined dense subclusters that help us obtain precise query translations. In this article, we demonstrate the use of multilingual word embeddings and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati, and Tamil.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3208358

Reference54 articles.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Query Expansion Using Proposed Location-Based Algorithm for Hindi–English CLIR: Analyzing Three Test Collections;International Journal of Pattern Recognition and Artificial Intelligence;2024-04

2. The Quality Evaluation Method of Sci-Tech English Translation for Intercultural Communication;Journal of Information & Knowledge Management;2022-06-02

3. Research on Intelligent Retrieval Model of Multilingual Text Information in Corpus;Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering;2022

4. Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text;Complex & Intelligent Systems;2021-08-17

5. The construction of digital multimedia image information retrieval model based on visual communication;2021 2nd International Conference on Artificial Intelligence and Information Systems;2021-05-28