Comparison of text preprocessing methods-Reference-Cited by-同舟云学术

Comparison of text preprocessing methods

Published:2022-06-13 Issue: Volume: Page:1-45
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Chai Christine P.^ORCID

Abstract

AbstractText preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference475 articles.

1. Polignano, M. , Basile, P. , De Gemmis, M. , Semeraro, G. and Basile, V. (2019). Alberto: Italian BERT language understanding model for NLP challenging tasks based on Tweets. In 6th Italian Conference on Computational Linguistics, CLiC-it 2019, vol. 2481. CEUR Workshop Proceedings, pp. 1–6.

2. Clough, P. (2001). A Perl program for sentence splitting using rules. Technical report, University of Sheffield, Sheffield, United Kingdom.

3. RelEx--Relation extraction using dependency parse trees

4. Stopword Graphs and Authorship Attribution in Text Corpora

5. GETTING THE ‘CORRECT’ ANSWER FROM SURVEY RESPONSES: A SIMPLE APPLICATION OF THE EM ALGORITHM

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Harvesting Natural Disaster Reports from Social Media with 1D Convolutional Neural Network and Long Short-Term Memory;2023 Eighth International Conference on Informatics and Computing (ICIC);2023-12-08

2. Financial sentiment analysis: Classic methods vs. deep learning models;Intelligent Decision Technologies;2023-11-20

3. Intelligent Web Service System for Detecting Cyberbullying on Twitter Based on Support Vector Machine and Random Forest Algorithms;2023 International Conference on Converging Technology in Electrical and Information Engineering (ICCTEIE);2023-10-25

4. Authorship Attribution on Short Texts in the Slovenian Language;Applied Sciences;2023-10-04

5. Public sentiment toward renewable energy in Morocco: opinion mining using a rule-based approach;Social Network Analysis and Mining;2023-09-25