Abstract
Cross-lingual document retrieval, which aims to take a query in one language to retrieve relevant documents in another, has attracted strong research interest in the last decades. Most studies on this task start with cross-lingual comparisons at the word level and then represent documents via word embeddings, which leads to insufficient structure information. In this work, the cross-lingual comparison at the document level is achieved through the cross-lingual semantic space. Our method, MDL (deep multilabel multilingual document learning), leverages a six-layer fully connected network to project cross-lingual documents into a shared semantic space. The semantic distances can be calculated when the cross-lingual documents are transformed into embeddings in semantic space. The supervision signals are automatically extracted from the data and then used to construct the semantic space via a linear classifier. The ambiguity of manual labels could be avoided and the multilabel supervision signals can be acquired instead of a single label. The representation of the semantic space is enriched by multilabel supervision signals, which improves the discriminative ability of the embeddings. The MDL is easy to extend to other fields since it does not depend on specific data. Furthermore, MDL is more efficient than the models training all languages jointly, since each language is trained individually. Experiments on Wikipedia data showed that the proposed method outperforms the state-of-the-art cross-lingual document retrieval methods.
Subject
General Physics and Astronomy
Reference36 articles.
1. Cross-Language Information Retrieval
2. Crosslingual Document Embedding as Reduced-Rank Ridge Regression;Josifoski;Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining,2019
3. A Wikipedia-Based Multilingual Retrieval Model;Potthast;Proceedings of the 30th European Conference on IR Research, ECIR 2008,2008
4. A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization;Franco-Salvador;Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014,2014
5. Exploiting context-dependency and acoustic resolution of universal speech attribute models in spoken language recognition;Siniscalchi;Proceedings of the 11th Annual Conference of the International Speech Communication Association,2010