Analyzing Indo-European Language Similarities Using Document Vectors-Reference-Cited by-同舟云学术

Analyzing Indo-European Language Similarities Using Document Vectors

Published:2023-09-26 Issue:4 Volume:10 Page:76
ISSN:2227-9709
Container-title:Informatics
language:en
Short-container-title:Informatics

Author:

Schrader Samuel R.¹,Gultepe Eren¹

Affiliation:

1. Department of Computer Science, Southern Illinois University Edwardsville, Edwardsville, IL 62026, USA

Abstract

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

Publisher

MDPI AG

Subject

Computer Networks and Communications,Human-Computer Interaction,Communication

Link

https://www.mdpi.com/2227-9709/10/4/76/pdf

Reference42 articles.

1. Jasonoff, J.H., and Cowgill, W. (2023, July 24). Indo-European Languages|Definition, Map, Characteristics, & Facts|Britannica. Available online: https://www.britannica.com/topic/Indo-European-languages/.

2. Language-tree divergence times support the Anatolian theory of Indo-European origin;Gray;Nature,2003

3. Nagata, R., and Whittaker, E. (2013, January 4–9). Reconstructing an Indo-European family tree from non-native English texts. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria.

4. Rabinovich, E., Ordan, N., and Wintner, S. (August, January 30). Found in Translation: Reconstructing Phylogenetic Language Trees from Translations. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada.

5. Indo-European languages tree by Levenshtein distance;Serva;EPL,2008