Abstract
This paper presents an approach to compare and classify books written in the Polish language by comparing their lexis fields. Books can be classified by their features, such as literature type, literary genre, style, author, etc. Using a preassembled dictionary and Jaccard index, we managed to prove a compact hypothesis concerning similar books. Further analysis with the PAM clustering algorithm presented a lexical connection between books of the same type or author. Overall static behaviour of similarities of any particular field on one side and some anomalous tendencies in other cases suggest that recognition of other features is possible. The method presented in this article allows drawing conclusions regarding the connection between any arbitrary books based solely on their vocabulary.
Subject
Polymers and Plastics,General Environmental Science
Reference13 articles.
1. R. Singh, S. Singh, Text Similarity Measures in News Articles by Vector Space Model Using NLP, Journal of The Institution of Engineers (India): Series B 102 (2021) 329–338.
2. A. Huang, Similarity Measures for Text Document Clustering, Proceedings of the Sixth New Zealand Computer Science Research Student Conference 4 (2008) 49–56.
3. M. B. Magara, S. O. Ojo, T. Zuva, A Comparative Analysis of Text Similarity Measures and Algorithms in Research Paper Recommender Systems, 2018 Conference on Information Communications Technology and Society (2018) 1–5.
4. A. W. Qurashi, V. Holmes, A. P. Johnson, Document Processing: Methods for Semantic Text Similarity Analysis, In 2020 International Conference on INnovations in Intelligent SysTems and Applications (2020) 1–6.
5. W. H. Gomaa, A. A. Fahmy, A Survey of Text Similarity Approaches, International Journal of Computer Applications 68 (2013) 13–18.