PARALLEL CORPUS OF THE KAZAKH AND RUSSIAN LANGUAGES: DEVELOPMENT, OPERATION AND PROBLEMS

Author:

Ashimbaeva N. M.1ORCID,Bisengali A. Z.1ORCID,Kulmanov S. K.1ORCID,Ayazbaev G. M.1ORCID,Nurlan M.1ORCID

Affiliation:

1. A. Baitursynuly Institute of Linguistics

Abstract

The research paper gives a brief overview of the history of the creation of linguistic corpora, describes their classification according to various criteria and types of parallel subcorpuses. The original Kazakh text of M. Auezov's epic novel «Abai Zholy» and its Russian translation, made by A. Kim, were manually aligned at the level of a paragraph (sentence) in a parallel subcorpus being developed as part of the national corpus of the Kazakh language.During the development of the parallel subcorpus, Microsoft Office Excel, Notepad++, Python, Django, MySQL software tools were used. The software architecture and the order of operation of the parallel subcorpus can be represented as follows: 1) texts in two languages were collected using the Excel office program and aligned manually at the paragraph (sentence) level; 2) aligned texts were loaded directly from an Excel file into the MySQL database management system; 3) the downloaded texts were sorted using the Notepad++ word processor program, their statistics were obtained; 4) the Django web server was used to publish the sorted texts on the Internet and provide user requests; 5) the Processing.py program written in Python and equipped with a search function was used to connect the Django web server to the MySQL database management system; 6) the parallel subcorpus software architecture was developed using client-server and MVC (Model-View-Controller) technologies.The parallel subcorpus consists of a database of aligned texts, markups, metamarkups and a search engine, information about the text entered into the subcorpus (metamarkup) includes the following parameters: author, translator, work title, translation title, publication date of the work, translation period, original language, translation language. The search engine allows users to find the desired word by parameters: word, phrase, sentence, and capital letters (in Kazakh and Russian). The paper describes the interface of the parallel subcorpus in Kazakh and Russian and the interface of the results after searching for the desired word through one of the search parameters, the total and non-repeating number of words used in the text in two languages, the number of sentences, as well as numerical and percentage values of the ten most commonly used words in both languages were determined.In addition, in the process of aligning the original Kazakh text of the epic novel with the Russian translated version at the paragraph (sentence) level, the following features were identified: 1) from the point of view of structure, that is, the words used in the paragraph (sentence) are approximately equivalent in number; 2) from the point of view of content, they approximately coincide; 3) do not coincide in structure and content: some paragraphs (sentences) in the original text in Kazakh are translated into Russian incorrectly, superficially or briefly, their approximate meaning is given.

Publisher

A.Baitursynuly Institute of Linguistics

Reference34 articles.

1. Svartvik J., Quirk R. (1980)A corpus of English Conversation. – Lund: Gleerup, 1980. – 284 p. (in English)

2. Francis W. (2022) Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. [Electron. resource] – URL: http://icame.uib.no/brown/bcm.html (date of review – 01.02.2022). (in English)

3. Hundt, Marianne.(2022) Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN). [Electron. resource] – URL: http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM (date of review – 01.02.2022). (in English)

4. Leech C. (2005) Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB. ICAME Journal. – Geoffrey & Nicholas Smith, 2005. № 29. – P. 83-98. (in English)

5. Zhubanov A., Zhanabekova A. (2017) Korpustyq lingvistica. [Corpus Linguistics] – Almaty, 2017. – 318 b. (in Kazakh)

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3