NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS

Author:

Porwoł Monika1ORCID

Affiliation:

1. State University of Applied Sciences in Racibórz, Institute of Modern Language Studies

Abstract

Investigation in the hybrid architectures for Natural Language Processing (NLP) requires overcoming complexity in various intellectual traditions pertaining to computer science, formal linguistics, logic, digital humanities, ethical issues and so on. NLP as a subfield of computer science and artificial intelligence is concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text (and speech) in order to create systems, such as: machine translation (converting from text in a source language to text in a target language), document summarization (converting from long texts into short texts), named entity recognition, predictive typing, et cetera. Undoubtedly, NLP phenomena have been implanted in our daily lives, for instance automatic Machine Translation (MT) is omnipresent in social media (or on the world wide web), virtual assistants (Siri, Cortana, Alexa, and so on) can recognize a natural voice or e-mail services use detection systems to filter out some spam messages. The purpose of this paper, however, is to outline the linguistic and NLP methods to textual processing. Therefore, the bag-of-n-grams concept will be discussed here as an approach to extract more details about the textual data in a string of a grouped words. The n-gram language model presented in this paper (that assigns probabilities to sequences of words in text corpora) is based on findings compiled in Sketch Engine, as well as samples of language data processed by means of NLTK library for Python. Why would one want to compute the probability of a word sequence? The answer is quite obvious – in various systems for performing tasks, the goal is to generate texts that are more fluent. Therefore, a particular component is required, which computes the probability of the output text. The idea is to collect information how frequently the n-grams occur in a large text corpus and use it to predict the next word. Counting the number of occurrences can also envisage certain drawbacks, for instance there are sometimes problems with sparsity or storage. Nonetheless, the language models and specific computing ‘recipes’ described in this paper can be used in many applications, such as machine translation, summarization, even dialogue systems, etc. Lastly, it has to be pointed out that this piece of writing is a part of an ongoing work tentatively termed as LADDER (Linguistic Analysis of Data in the Digital Era of Research) that touches upon the process of datacization[1] that might help to create an intelligent system of interdisciplinary information.

Publisher

Borys Grinchenko Kyiv University

Reference20 articles.

1. 1. Abend, O & Rappoport, A. ‘The State of the Art in Semantic Representation’. Proceedings of the Association for Computational Linguistics (ACL). [Available online]: https://www.aclweb.org/anthology/P17-1008.pdf

2. 2. Ahmed, B., Cha, S. H. & Tappert, C. (2004). ‘Language Identification from Text Using N-gram Based Cumulative Frequency Addition’. Proceedings of Student/Faculty Research Day, CSIS, Pace University.

3. 3. Akmajian, A., Demers, R. A., Farmer, A. K. & Harnish, R. M. (1997). Linguistics: An Introduction to Language and Communication. 4th ed., MIT Press, Cambridge, MA.

4. 4. Briscoe, T. (2013). ‘Introduction to Linguistics for Natural Language Processing’. [Available online]: https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf

5. 5. Brown, R. D. (2012). ‘Finding and Identifying Text in 900+ Languages’. Digital Investigation, 9, pp. 34–43. [Available online]: https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3