Toward Best Practices for Training Multilingual Dense Retrieval Models-Reference-Cited by-同舟云学术

Toward Best Practices for Training Multilingual Dense Retrieval Models

Published:2023-09-27 Issue:2 Volume:42 Page:1-33
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Zhang Xinyu¹^ORCID,Ogueji Kelechi¹^ORCID,Ma Xueguang¹^ORCID,Lin Jimmy¹^ORCID

Affiliation:

1. University of Waterloo, Canada

Abstract

Dense retrieval models using a transformer-based bi-encoder architecture have emerged as an active area of research. In this article, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using such an architecture. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a “best practices” guide for training multilingual dense retrieval models, broken down into three main scenarios: when a multilingual transformer is available, but training data in the form of relevance judgments are not available in the language and domain of interest (“have model, no data”); when both models and training data are available (“have model and data”); and when training data are available but not models (“have data, no model”). In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.

Funder

Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3613447

Reference67 articles.

1. Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, and Marta Villegas. 2021. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4933–4946.

2. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4623–4637.

3. Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. XOR QA: Cross-lingual open-retrieval question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 547–564.

4. Akari Asai, Xinyan Yu, Jungo Kasai, and Hanna Hajishirzi. 2021. One question answering model for many languages with cross-lingual dense passage retrieval. In Advances in Neural Information Processing Systems, Vol. 34. 7547–7560.

5. MS MARCO: A human generated MAchine reading comprehension dataset;Bajaj Payal;arXiv:1611.09268v3,2018