Extended ProMap datasets for product mapping
-
Published:2024-08-22
Issue:
Volume:
Page:
-
ISSN:1389-5753
-
Container-title:Electronic Commerce Research
-
language:en
-
Short-container-title:Electron Commer Res
Author:
Macková KateřinaORCID, Pilát MartinORCID
Abstract
AbstractProduct mapping or product matching is the field of research dedicated to solving the problem of identifying which product listings (including names, descriptions, specifications, images, and other information) from different e-shops refer to the same product. The problem belongs among important data integration tasks processing data originating from different sources and with different structures. In our previous work, we created basic ProMapEn and ProMapCz datasets for product mapping in English and Czech. The main advantage of the ProMap datasets compared to existing product mapping datasets is that they contain different types of non-matches based on the similarity of the two products. In this paper, we extend the previous two datasets into a completely new collection of datasets for generalized product mapping in the Czech and English languages. We publish those datasets freely for other researchers in the area of product mapping on e-commerce. The main contributions are the extension of the ProMap datasets by adding a new class of non-matching products, the introduction of new ProMapMulti datasets of product pairs from multiple English e-shops, and the introduction of ProMapTransl datasets, obtained by translating the Czech datasets to English and vice versa. Moreover, we provide a very detailed analysis of these datasets with several experiments based on neural network techniques comparing different text preprocessing methods, and similarity computation methods. We also compare the differences among several product categories and evaluate state-of-the-art product mapping methods on these datasets. We also include generalised entity matching techniques and compare their behaviour on product mapping datasets which belong to this area. Finally, we include an appendix with a number of other basic experiments, such as an analysis of feature importances.
Funder
Univerzita Karlova v Praze Charles University
Publisher
Springer Science and Business Media LLC
Reference25 articles.
1. Akritidis, L., & Bozanis, P. (2018). Effective unsupervised matching of product titles with k-combinations and permutations. In 2018 innovations in intelligent systems and applications (INISTA), pp. 1–10. https://doi.org/10.1109/INISTA.2018.8466294 2. Akritidis, L., Fevgas, A., & Bozanis, P. (2018). Effective products categorization with importance scores and morphological analysis of the titles. In 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI), pp. 213–220. https://doi.org/10.1109/ICTAI.2018.00041 3. Akritidis, L., Fevgas, A., Bozanis, P., & Makris, C. (2020). A self-verifying clustering approach to unsupervised matching of product titles. Artificial Intelligence Review, 53(7), 4777–4820. https://doi.org/10.1007/s10462-020-09807-8 4. Bañón, M., Chen, P., Haddow, B., Heafield, K., Hoang, H., Espla-Gomis, M., Forcada, M., Kamran, A., Kirefu, F., Koehn, P., & Ortiz-Rojas, S. (2020). Paracrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 4555–4567. 5. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR arXiv:1911.02116
|
|