Affiliation:
1. Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India
2. Department of Computer Science and Application, Indian Institute of Science Education and Research Kolkata, Mohanpur, India
Abstract
A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.
Funder
Building Healthcare Informatics Systems Utilising Web Data
Department of Science & Technology, Government of India
NVIDIA Corporation
Titan Xp GPU
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems and Management,Information Systems
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Robust Cyberbullying Detection in Diverse Textual Noise;Lecture Notes in Computer Science;2024
2. Lexical Normalization Using Generative Transformer Model (LN-GTM);International Journal of Computational Intelligence Systems;2023-11-14
3. An end-to-end pipeline for historical censuses processing;International Journal on Document Analysis and Recognition (IJDAR);2023-03-17
4. Research on the application of MOSL information retrieval method in educational resource management;International Journal of Knowledge-Based Development;2023
5. Stance Detection using Two Popular Benchmarks: A Survey;2022 2nd International Conference on Emerging Smart Technologies and Applications (eSmarTA);2022-10-25