A Study of Contrastive Learning Algorithms for Sentence Representation Based on Simple Data Augmentation
-
Published:2023-09-08
Issue:18
Volume:13
Page:10120
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Liu Xiaodong1, Gong Wenyin1ORCID, Li Yuxin1, Li Yanchi1ORCID, Li Xiang1ORCID
Affiliation:
1. School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430079, China
Abstract
In the era of deep learning, representational text-matching algorithms based on BERT and its variant models have become mainstream and are limited by the sentence vectors generated by the BERT model, and the SimCSE algorithm proposed in 2021 has improved the sentence vector quality to a certain extent. In this paper, to address the problem that the SimCSE algorithm has—that the greater the difference in sentence length, the smaller the probability that the sentence pairs are similar—an EdaCSE algorithm is proposed to perturb the sentence length using a simple data enhancement method without affecting the semantics of the sentences. The perturbation is applied to the sentence length by adding meaningless English punctuation marks to the original sentence so that the model no longer tends to recognise sentences of similar length as similar sentences. Based on the BERT series of models, experiments were conducted on five different datasets, and the experiments proved that the EdaCSE method improves an average of 1.67, 0.84, and 1.08 on the five datasets.
Funder
research on adaptive integrated evolutionary algorithm for multi-root solution of complex nonlinear equations
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference21 articles.
1. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2. Gao, T., Yao, X., and Chen, D. (2021, January 7–11). SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Association for Computational Linguistics (ACL), Punta Cana, Dominican Republic. 3. Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., and Xu, W. (2021, January 1–6). ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event. 4. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 13–18). A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning, PMLR, Virtual. 5. Wu, X., Gao, C., Zang, L., Han, J., Wang, Z., and Hu, S. (2022, January 12–17). ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding. Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea.
|
|