Affiliation:
1. Ailamazyan Program Systems Institute of RAS
Abstract
The choice of search tools for hidden commonality in the data of a new nature requires stable and reproducible comparative assessments of the quality of abstract algorithms for the proximity of symbol strings. Conventional estimates based on artificially generated or manually labeled tests vary significantly, rather evaluating the method of this artificial generation with respect to similarity algorithms, and estimates based on user data cannot be accurately reproduced.
A simple, transparent, objective and reproducible numerical quality assessment of a string metric. Parallel texts of book translations in different languages are used. The quality of a measure is estimated by the percentage of errors in possible different tries of determining the translation of a given paragraph among two paragraphs of a book in another language, one of which is actually a translation. The stability of assessments is verified by independence from the choice of a book and a pair of languages.
The numerical experiment steadily ranked by quality algorithms for abstract character string comparisons and showed a strong dependence on the choice of normalization.
Publisher
Ailamazyan Program Systems Institute of Russian Academy of Sciences (PSI RAS)