Abstract
The paper presents the results of evaluating the informative value of quantitative and binary signs to solve the problem of finding semantically close sentences (paraphrases). Three types of signs are considered in the article: those built on vector representations of words (according to the Word2Vec model), based on the extraction of numbers and structured information and reflecting the quantitative characteristics of the text. As indicators of information content, the percentage of paraphrases among examples with a characteristic, and the percentage of paraphrases with a attribute (for binary characteristics), as well as estimates using the accumulated frequency method (for quantitative indicators) are used. The assessment was conducted on the Russian paraphrase corps. The set of features considered in the work was tested as input for two machine learning models for defining semantically close sentences: reference vector machines (SVMs) and a recurrent neural network model. The first model accepts only the considered set of signs as input parameters, the second - the text in the form of sequences and the set of signs as an additional input. The quality of the models was 67.06% (F-measure) and 69.49% (accuracy) and 79.85% (F-measure) and 74.16% (accuracy), respectively. The result obtained in the work is comparable with the best results of the systems presented in 2017 at the competition for the definition of paraphrase for the Russian language (the second result for the F-measure, the third result for accuracy). The results proposed in the work can be used both in the implementation of search models for semantically close fragments of texts in natural language, and for the analysis of Russian-language paraphrases from the point of view of computer linguistics.