New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies-Reference-Cited by-同舟云学术

New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies

Published:2017-10-06 Issue:1 Volume:24 Page:91-122
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

GARCIA MARCOS^ORCID,GÓMEZ-RODRÍGUEZ CARLOS,ALONSO MIGUEL A.

Abstract

AbstractThis paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference66 articles.

1. Synthetic Treebanking for Cross-Lingual Dependency Parsing

2. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Training and evaluation of vector models for Galician;Language Resources and Evaluation;2024-06-04

2. A Cross Language Transfer Learning Algorithm for French Corpus Based on Knowledge Distillation;2024 International Conference on Electrical Drives, Power Electronics & Engineering (EDPEE);2024-02-27

3. Bertinho: Galician BERT Representations;PROCES LENG NAT;2021

4. A Database and Visualization of the Similarity of Contemporary Lexicons;Text, Speech, and Dialogue;2021

5. Coarse-Grained vs. Fine-Grained Lithuanian Dependency Parsing;Intelligent Algorithms in Software Engineering;2020