PARALLEL CORPUS OF THE KAZAKH AND RUSSIAN LANGUAGES: DEVELOPMENT, OPERATION AND PROBLEMS-Reference-Cited by-同舟云学术

PARALLEL CORPUS OF THE KAZAKH AND RUSSIAN LANGUAGES: DEVELOPMENT, OPERATION AND PROBLEMS

Published:2023-06-30 Issue:2 Volume: Page:49-61
ISSN:2709-135X
Container-title:Tiltanym
language:
Short-container-title:jour

Author:

Ashimbaeva N. M.¹^ORCID,Bisengali A. Z.¹^ORCID,Kulmanov S. K.¹^ORCID,Ayazbaev G. M.¹^ORCID,Nurlan M.¹^ORCID

Affiliation:

1. A. Baitursynuly Institute of Linguistics

Abstract

The research paper gives a brief overview of the history of the creation of linguistic corpora, describes their classification according to various criteria and types of parallel subcorpuses. The original Kazakh text of M. Auezov's epic novel «Abai Zholy» and its Russian translation, made by A. Kim, were manually aligned at the level of a paragraph (sentence) in a parallel subcorpus being developed as part of the national corpus of the Kazakh language.During the development of the parallel subcorpus, Microsoft Office Excel, Notepad++, Python, Django, MySQL software tools were used. The software architecture and the order of operation of the parallel subcorpus can be represented as follows: 1) texts in two languages were collected using the Excel office program and aligned manually at the paragraph (sentence) level; 2) aligned texts were loaded directly from an Excel file into the MySQL database management system; 3) the downloaded texts were sorted using the Notepad++ word processor program, their statistics were obtained; 4) the Django web server was used to publish the sorted texts on the Internet and provide user requests; 5) the Processing.py program written in Python and equipped with a search function was used to connect the Django web server to the MySQL database management system; 6) the parallel subcorpus software architecture was developed using client-server and MVC (Model-View-Controller) technologies.The parallel subcorpus consists of a database of aligned texts, markups, metamarkups and a search engine, information about the text entered into the subcorpus (metamarkup) includes the following parameters: author, translator, work title, translation title, publication date of the work, translation period, original language, translation language. The search engine allows users to find the desired word by parameters: word, phrase, sentence, and capital letters (in Kazakh and Russian). The paper describes the interface of the parallel subcorpus in Kazakh and Russian and the interface of the results after searching for the desired word through one of the search parameters, the total and non-repeating number of words used in the text in two languages, the number of sentences, as well as numerical and percentage values of the ten most commonly used words in both languages were determined.In addition, in the process of aligning the original Kazakh text of the epic novel with the Russian translated version at the paragraph (sentence) level, the following features were identified: 1) from the point of view of structure, that is, the words used in the paragraph (sentence) are approximately equivalent in number; 2) from the point of view of content, they approximately coincide; 3) do not coincide in structure and content: some paragraphs (sentences) in the original text in Kazakh are translated into Russian incorrectly, superficially or briefly, their approximate meaning is given.

Publisher

A.Baitursynuly Institute of Linguistics

Reference34 articles.

1. Svartvik J., Quirk R. (1980)A corpus of English Conversation. – Lund: Gleerup, 1980. – 284 p. (in English)

2. Francis W. (2022) Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. [Electron. resource] – URL: http://icame.uib.no/brown/bcm.html (date of review – 01.02.2022). (in English)

3. Hundt, Marianne.(2022) Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN). [Electron. resource] – URL: http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM (date of review – 01.02.2022). (in English)

4. Leech C. (2005) Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB. ICAME Journal. – Geoffrey & Nicholas Smith, 2005. № 29. – P. 83-98. (in English)

5. Zhubanov A., Zhanabekova A. (2017) Korpustyq lingvistica. [Corpus Linguistics] – Almaty, 2017. – 318 b. (in Kazakh)

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. KAZAKH DIGITAL TERMINOLOGY: EXPERIENCE IN DEVELOPING A TERMINOLOGICAL SUBCORPUS (metatext and terminology markup parameters);Tiltanym;2024-04-30