Affiliation:
1. Martin Luther University Halle-Wittenberg , Institute of Computer Science , Halle (Saale) , Germany
Abstract
Abstract
In this paper,
A shorter version of the paper appeared in German in the final report of the Digital Plato project which was funded by the Volkswagen Foundation from 2016 to 2019. [35], [28].
we present a method for paraphrase extraction in Ancient Greek that can be applied to huge text corpora in interactive humanities applications. Since lexical databases and POS tagging are either unavailable or do not achieve sufficient accuracy for ancient languages, our approach is based on pure word embeddings and the word mover’s distance (WMD) [20]. We show how to adapt the WMD approach to paraphrase searching such that the expensive WMD computation has to be computed for a small fraction of the text segments contained in the corpus, only. Formally, the time complexity will be reduced from
O
(
N
·
K
3
·
log
K
)
\mathcal{O}(N\cdot {K^{3}}\cdot \log K)
to
O
(
N
+
K
3
·
log
K
)
\mathcal{O}(N+{K^{3}}\cdot \log K)
, compared to the brute-force approach which computes the WMD between each text segment of the corpus and the search query. N is the length of the corpus and K the size of its vocabulary. The method, which searches not only for paraphrases of the same length as the search query but also for paraphrases of varying lengths, was evaluated on the Thesaurus Linguae Graecae® (TLG®) [25]. The TLG consists of about
75
·
10
6
75\cdot {10^{6}}
Greek words. We searched the whole TLG for paraphrases for given passages of Plato. The experimental results show that our method and the brute-force approach, with only very few exceptions, propose the same text passages in the TLG as possible paraphrases. The computation times of our method are in a range that allows its application in interactive systems and let the humanities scholars work productively and smoothly.
Reference43 articles.
1. B. Agarwal, H. Ramampuaro, H. Langseth, and M. Ruocco. A Deep Network Model for Paraphrase Detection in Short Text Messages. In: Information Processing & Management, vol. 54, issue 6, pages 922–937, 2018.
2. I. Androutsopoulos and P. Malakasiotis. A Survey of Paraphrasing and Textual Entailment Methods. In: Journal of Artificial Intelligence Research, Vol. 38, pages 135–187, 2010.
3. K. Atasu, T. Parnell, C. Dünner, M. Sifalakis, H. Pozidis, V. Vasileiadis, M. Vlachos, C. Berrospi and A. Labbi. Linear-Complexity Relaxed Word Mover’s Distance with GPU Acceleration. In: IEEE International Conference on Big Data, Big Data 2017, Boston, MA, USA, pp. 889–896, December 11–14 2017.
4. Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In: Journal of Machine Learning Research, Vol. 3, pages 1137–1155, 2003.
5. Y. Bizzoni, R. Del Gratta, F. Boschetti, and M. Reboul. Enhancing the Accuracy of Ancient Greek WordNet by Multilingual. Distributional Semantics. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1140–1147, 2014.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献