Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval

Author:

Feng Kai,Huang Lan,Xu Hao,Wang KangpingORCID,Wei Wei,Zhang RuiORCID

Abstract

Cross-lingual document retrieval, which aims to take a query in one language to retrieve relevant documents in another, has attracted strong research interest in the last decades. Most studies on this task start with cross-lingual comparisons at the word level and then represent documents via word embeddings, which leads to insufficient structure information. In this work, the cross-lingual comparison at the document level is achieved through the cross-lingual semantic space. Our method, MDL (deep multilabel multilingual document learning), leverages a six-layer fully connected network to project cross-lingual documents into a shared semantic space. The semantic distances can be calculated when the cross-lingual documents are transformed into embeddings in semantic space. The supervision signals are automatically extracted from the data and then used to construct the semantic space via a linear classifier. The ambiguity of manual labels could be avoided and the multilabel supervision signals can be acquired instead of a single label. The representation of the semantic space is enriched by multilabel supervision signals, which improves the discriminative ability of the embeddings. The MDL is easy to extend to other fields since it does not depend on specific data. Furthermore, MDL is more efficient than the models training all languages jointly, since each language is trained individually. Experiments on Wikipedia data showed that the proposed method outperforms the state-of-the-art cross-lingual document retrieval methods.

Funder

Lan Huang

Publisher

MDPI AG

Subject

General Physics and Astronomy

Reference36 articles.

1. Cross-Language Information Retrieval

2. Crosslingual Document Embedding as Reduced-Rank Ridge Regression;Josifoski;Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining,2019

3. A Wikipedia-Based Multilingual Retrieval Model;Potthast;Proceedings of the 30th European Conference on IR Research, ECIR 2008,2008

4. A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization;Franco-Salvador;Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014,2014

5. Exploiting context-dependency and acoustic resolution of universal speech attribute models in spoken language recognition;Siniscalchi;Proceedings of the 11th Annual Conference of the International Speech Communication Association,2010

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3