Affiliation:
1. University of Waterloo, Canada
Abstract
Dense retrieval models using a transformer-based bi-encoder architecture have emerged as an active area of research. In this article, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using such an architecture. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a “best practices” guide for training multilingual dense retrieval models, broken down into three main scenarios: when a multilingual transformer is available, but training data in the form of relevance judgments are not available in the language and domain of interest (“have model, no data”); when both models and training data are available (“have model and data”); and when training data are available but not models (“have data, no model”). In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.
Funder
Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Reference67 articles.
1. Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, and Marta Villegas. 2021. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4933–4946.
2. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4623–4637.
3. Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. XOR QA: Cross-lingual open-retrieval question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 547–564.
4. Akari Asai, Xinyan Yu, Jungo Kasai, and Hanna Hajishirzi. 2021. One question answering model for many languages with cross-lingual dense passage retrieval. In Advances in Neural Information Processing Systems, Vol. 34. 7547–7560.
5. MS MARCO: A human generated MAchine reading comprehension dataset;Bajaj Payal;arXiv:1611.09268v3,2018
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. A Knowledge Graph Embedding Model for Answering Factoid Entity Questions;ACM Transactions on Information Systems;2024-07-15
2. ArabicaQA: A Comprehensive Dataset for Arabic Question Answering;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
3. Steering Large Language Models for Cross-lingual Information Retrieval;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
4. Query in Your Tongue: Reinforce Large Language Models with Retrievers for Cross-lingual Search Generative Experience;Proceedings of the ACM Web Conference 2024;2024-05-13
5. CIRAL at FIRE 2023: Cross-Lingual Information Retrieval for African Languages;Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation;2023-12-15