Deciphering Undersegmented Ancient Scripts Using Phonetic Prior-Reference-Cited by-同舟云学术

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Published:2021-02 Issue: Volume:9 Page:69-81
ISSN:2307-387X
Container-title:Transactions of the Association for Computational Linguistics
language:en
Short-container-title:Transactions of the Association for Computational Linguistics

Author:

Luo Jiaming¹,Hartmann Frederik²,Santus Enrico³,Barzilay Regina¹,Cao Yuan⁴

Affiliation:

1. CSAIL, MIT.

2. University of Konstanz.

3. Bayer.

4. Google Brain.

Abstract

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship. 1

Publisher

MIT Press - Journals

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00354

Reference25 articles.

1. NLTK

2. A massively parallel corpus: the Bible in 100 languages

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. EASA Expert Group: Science, Technology, Engineering, Mathematics in Arts and Culture (STEMAC);Proceedings of the European Academy of Sciences and Arts;2024-03-28

2. A Systematic Review of Computational Approaches to Deciphering Bronze Age Aegean and Cypriot Scripts;Computational Linguistics;2024

3. A Generative Model for the Mycenaean Linear B Script and Its Application in Infilling Text from Ancient Tablets;Journal on Computing and Cultural Heritage;2023-08-09

4. Machine Learning for Ancient Languages: A Survey;Computational Linguistics;2023

5. Restoring and attributing ancient texts using deep neural networks;Nature;2022-03-09