Toward a shallow discourse parser for Turkish-Reference-Cited by-同舟云学术

Toward a shallow discourse parser for Turkish

Published:2023-08-11 Issue: Volume: Page:1-26
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Kutlu Ferhat^ORCID,Zeyrek Deniz^ORCID,Kurfalı Murathan^ORCID

Abstract

Abstract One of the most interesting aspects of natural language is how texts cohere, which involves the pragmatic or semantic relations that hold between clauses (addition, cause-effect, conditional, similarity), referred to as discourse relations. A focus on the identification and classification of discourse relations appears as an imperative challenge to be resolved to support tasks such as text summarization, dialogue systems, and machine translation that need information above the clause level. Despite the recent interest in discourse relations in well-known languages such as English, data and experiments are still needed for typologically different and less-resourced languages. We report the most comprehensive investigation of shallow discourse parsing in Turkish, focusing on two main sub-tasks: identification of discourse relation realization types and the sense classification of explicit and implicit relations. The work is based on the approach of fine-tuning a pre-trained language model (BERT) as an encoder and classifying the encoded data with neural network-based classifiers. We firstly identify the discourse relation realization type that holds in a given text, if there is any. Then, we move on to the sense classification of the identified explicit and implicit relations. In addition to in-domain experiments on a held-out test set from the Turkish Discourse Bank (TDB 1.2), we also report the out-domain performance of our models in order to evaluate its generalization abilities, using the Turkish part of the TED Multilingual Discourse Bank. Finally, we explore the effect of multilingual data aggregation on the classification of relation realization type through a cross-lingual experiment. The results suggest that our models perform relatively well despite the limited size of the TDB 1.2 and that there are language-specific aspects of detecting the types of discourse relation realization. We believe that the findings are important both in providing insights regarding the performance of the modern language models in a typologically different language and in the low-resource scenario, given that the TDB 1.2 is 1/20th of the Penn Discourse TreeBank in terms of the number of total relations.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference73 articles.

1. Usage disambiguation of Turkish discourse connectives

2. Zeyrek, D. and Kurfalı, M. (2017). TDB 1.1: Extensions on Turkish Discourse Bank. In Proceedings of the 11th Linguistic Annotation Workshop, LAW@EACL 2017, April 3, 2017, Valencia, Spain. ACL, pp. 76–81.

3. Class-based n-gram models of natural language;Brown;Computational Linguistics,,1992

4. Prasad, R. , Miltsakaki, E. , Dinesh, N. , Lee, A. and Joshi, A. (2008). The Penn Discourse TreeBank 2.0 Annotation Manual. Technical report, Institute for Research in Cognitive Science, University of Pennsylvania.

5. Effective Approaches to Attention-based Neural Machine Translation