A Syllable-Based Technique for Uyghur Text Compression-Reference-Cited by-同舟云学术

A Syllable-Based Technique for Uyghur Text Compression

Published:2020-03-23 Issue:3 Volume:11 Page:172
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Abliz Wayit^ORCID,Wu Hao,Maimaiti Maihemuti,Wushouer Jiamila,Abiderexiti Kahaerjiang,Yibulayin Tuergen,Wumaier Aishan

Abstract

To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.

Funder

National Natural Science Foundation of China

National Key Research and Development Project of China

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/11/3/172/pdf

Reference32 articles.

1. Data Compression;David,2003

2. A Mathematical Theory of Communication