Statistical evaluation of the information content of attributes for the task of searching for semantically close sentences-Reference-Cited by-同舟云学术

Statistical evaluation of the information content of attributes for the task of searching for semantically close sentences

Published:2020-01 Issue:1 Volume: Page:8-17
ISSN:2454-0714
Container-title:Программные системы и вычислительные методы
language:en
Short-container-title:

Author:

Glazkova Anna Valer'evna

Abstract

The paper presents the results of evaluating the informative value of quantitative and binary signs to solve the problem of finding semantically close sentences (paraphrases). Three types of signs are considered in the article: those built on vector representations of words (according to the Word2Vec model), based on the extraction of numbers and structured information and reflecting the quantitative characteristics of the text. As indicators of information content, the percentage of paraphrases among examples with a characteristic, and the percentage of paraphrases with a attribute (for binary characteristics), as well as estimates using the accumulated frequency method (for quantitative indicators) are used. The assessment was conducted on the Russian paraphrase corps. The set of features considered in the work was tested as input for two machine learning models for defining semantically close sentences: reference vector machines (SVMs) and a recurrent neural network model. The first model accepts only the considered set of signs as input parameters, the second - the text in the form of sequences and the set of signs as an additional input. The quality of the models was 67.06% (F-measure) and 69.49% (accuracy) and 79.85% (F-measure) and 74.16% (accuracy), respectively. The result obtained in the work is comparable with the best results of the systems presented in 2017 at the competition for the definition of paraphrase for the Russian language (the second result for the F-measure, the third result for accuracy). The results proposed in the work can be used both in the implementation of search models for semantically close fragments of texts in natural language, and for the analysis of Russian-language paraphrases from the point of view of computer linguistics.

Publisher

Aurora Group, s.r.o

Reference25 articles.

1. El Desouki M. I., Gomaa W. H. Exploring the Recent Trends of Paraphrase Detection //International Journal of Computer Applications. – 2019. – T. 975. – S. 8887. DOI: https://doi.org/10.5120/ijca2019918317.

2. Smerdov A. N., Bakhteev O. Y., Strijov V. V. Optimal recurrent neural network model in paraphrase detection⇤ //Informatika i Ee Primeneniya [Informatics and its Applications]. – 2018. – T. 12. – №. 4. – S. 63-69. DOI: https://doi.org/10.14357/19922264180409.

3. Yin W., Schütze H. Convolutional neural network for paraphrase identification //Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. – 2015. – S. 901-911. DOI: https://doi.org/10.3115/v1/n15-1091.

4. Neculoiu P., Versteegh M., Rotaru M. Learning text similarity with siamese recurrent networks //Proceedings of the 1st Workshop on Representation Learning for NLP. – 2016. – S. 148-157. DOI: https://doi.org/10.18653/v1/w16-1617.

5. Dien D. et al. Vietnamese-English Cross-Lingual Paraphrase Identification Using Siamese Recurrent Architectures //2019 19th International Symposium on Communications and Information Technologies (ISCIT). – IEEE, 2019. – S. 70-75. DOI: https://doi.org/10.1109/iscit.2019.8905116.