Improved feature decay algorithms for statistical machine translation

Author:

Poncelas AlbertoORCID,Maillette de Buy Wenniger GideonORCID,Way AndyORCID

Abstract

AbstractIn machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference58 articles.

1. Zens, R. , Stanton, D. and Xu, P. (2012). A systematic comparison of phrase table pruning techniques. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea. Association for Computational Linguistics, pp. 972–983.

2. Snover, M. , Dorr, B. , Schwartz, R. , Micciulla, L. and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas, pp. 223–231.

3. Selecting Artificially-Generated Sentences for Fine-Tuning Neural Machine Translation

4. Active learning for statistical phrase-based machine translation

5. Gascó, G. , Rocha, M.-A. , Sanchis-Trilles, G. , Andrés-Ferrer, J. and Casacuberta, F. (2012). Does more data always yield better translations? In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France. Association for Computational Linguistics, pp. 152–161.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Cellular automata-based MapReduce design: Migrating a big data processing model from Industry 4.0 to Industry 5.0;e-Prime - Advances in Electrical Engineering, Electronics and Energy;2024-06

2. Research on the Optimal Selection Method of Fuzzy Semantics in English Long Sentence Machine Translation;2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT);2022-12-09

3. English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus;Computational Intelligence and Neuroscience;2022-09-27

4. Design of English Translation Mobile Information System Based on Recurrent Neural Network;Mobile Information Systems;2022-08-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3