NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS-Reference-Cited by-同舟云学术

NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS

Published:2020 Issue:15 Volume:2 Page:6-13
ISSN:2412-2491
Container-title:Studia Philologica
language:uk
Short-container-title:Studia Philologica

Author:

Porwoł Monika¹^ORCID

Affiliation:

1. State University of Applied Sciences in Racibórz, Institute of Modern Language Studies

Abstract

Investigation in the hybrid architectures for Natural Language Processing (NLP) requires overcoming complexity in various intellectual traditions pertaining to computer science, formal linguistics, logic, digital humanities, ethical issues and so on. NLP as a subfield of computer science and artificial intelligence is concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text (and speech) in order to create systems, such as: machine translation (converting from text in a source language to text in a target language), document summarization (converting from long texts into short texts), named entity recognition, predictive typing, et cetera. Undoubtedly, NLP phenomena have been implanted in our daily lives, for instance automatic Machine Translation (MT) is omnipresent in social media (or on the world wide web), virtual assistants (Siri, Cortana, Alexa, and so on) can recognize a natural voice or e-mail services use detection systems to filter out some spam messages. The purpose of this paper, however, is to outline the linguistic and NLP methods to textual processing. Therefore, the bag-of-n-grams concept will be discussed here as an approach to extract more details about the textual data in a string of a grouped words. The n-gram language model presented in this paper (that assigns probabilities to sequences of words in text corpora) is based on findings compiled in Sketch Engine, as well as samples of language data processed by means of NLTK library for Python. Why would one want to compute the probability of a word sequence? The answer is quite obvious – in various systems for performing tasks, the goal is to generate texts that are more fluent. Therefore, a particular component is required, which computes the probability of the output text. The idea is to collect information how frequently the n-grams occur in a large text corpus and use it to predict the next word. Counting the number of occurrences can also envisage certain drawbacks, for instance there are sometimes problems with sparsity or storage. Nonetheless, the language models and specific computing ‘recipes’ described in this paper can be used in many applications, such as machine translation, summarization, even dialogue systems, etc. Lastly, it has to be pointed out that this piece of writing is a part of an ongoing work tentatively termed as LADDER (Linguistic Analysis of Data in the Digital Era of Research) that touches upon the process of datacization[1] that might help to create an intelligent system of interdisciplinary information.

Publisher

Borys Grinchenko Kyiv University

Reference20 articles.

1. 1. Abend, O & Rappoport, A. ‘The State of the Art in Semantic Representation’. Proceedings of the Association for Computational Linguistics (ACL). [Available online]: https://www.aclweb.org/anthology/P17-1008.pdf

2. 2. Ahmed, B., Cha, S. H. & Tappert, C. (2004). ‘Language Identification from Text Using N-gram Based Cumulative Frequency Addition’. Proceedings of Student/Faculty Research Day, CSIS, Pace University.

3. 3. Akmajian, A., Demers, R. A., Farmer, A. K. & Harnish, R. M. (1997). Linguistics: An Introduction to Language and Communication. 4th ed., MIT Press, Cambridge, MA.

4. 4. Briscoe, T. (2013). ‘Introduction to Linguistics for Natural Language Processing’. [Available online]: https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf

5. 5. Brown, R. D. (2012). ‘Finding and Identifying Text in 900+ Languages’. Digital Investigation, 9, pp. 34–43. [Available online]: https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf