Design of New Word Retrieval Algorithm for Chinese-English Bilingual Parallel Corpus

Author:

Zhang Liting1ORCID

Affiliation:

1. Faculty of Humanities, Gansu Agricultural University, Lanzhou 730070, China

Abstract

Natural language processing is an important direction in the field of computer science and artificial intelligence. It can realize various theories and methods of effective communication between humans and computers using natural language. Machine learning is a branch of natural language processing research, which is based on a large-scale English-Chinese database. Due to the relatively poor alignment corpus of English and Chinese bilingual sentences containing unknown words, machine translation is unprofessional and unbalanced, which is the problem studied in this paper. The purpose of this paper is to design and implement a length-based system for sentence alignment between English and Chinese bilingual texts. The research content of this paper is mainly divided into the following parts. First, the evaluation function of bilingual sentence alignment is designed, and on this basis, the bilingual sentence alignment algorithm based on the length and the optimal sentence pair sequence search algorithm is designed. In this paper, China National Knowledge Infrastructure (CNKI) is selected as an English-Chinese bilingual candidate website and English-Chinese bilingual web pages are downloaded. After analyzing the downloaded pages, nontext content such as page tags is removed, and bilingual text information is stored so as to establish an English-Chinese bilingual corpus based on segment alignment and retain English-Chinese bilingual keywords in the web pages. Second, extract the dictionary from the software StarDict, analyze the original dictionary format, and turn it into a custom dictionary format, which is convenient and better to use the double-sentence sentence alignment system, which is conducive to expanding the number of dictionaries and increasing the professionalism of vocabulary. Finally, we extract the stems of English words from the established corpus to simplify the complexity of English word processing, reduce the noise caused by the conversion of word parts of speech, and improve the operation efficiency. A bilingual sentence alignment system based on length is implemented. Finally, the system parameters are adjusted for comparative experiments to test the system performance.

Funder

Gansu Agricultural University

Publisher

Hindawi Limited

Subject

General Engineering,General Mathematics

Reference34 articles.

1. A beam search decoder for phrase-based statistical machine translation models:training manual;P. Koehn;USC Information Sciences Institute,2004

2. Mining translations of oov terms from the web through crosslingual query expansion;Z. Ying

3. Web-based query translation for English-Chinese CLIR;C. Y. Lu;Computational Linguisticsand Chinese Language Processing,2008

4. Discovering parallel text from the world wide web;C. Jisong

5. A research on bilingual dictionary based sentence alignment for Chinese English parallel corpus;Y. Muyun;High Technology Letters,2002

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3