Effect of dimension size and window size on word embedding in classification tasks-Reference-Cited by-同舟云学术

Effect of dimension size and window size on word embedding in classification tasks

Published:2024-07-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Držík Dávid¹,Kapusta Jozef¹

Affiliation:

1. Constantine the Philosopher University in Nitra

Abstract

In natural language processing, there are several approaches to transform text into multi-dimensional word vectors, such as TF-IDF (term frequency - inverse document frequency), Word2Vec, GloVe (Global Vectors), which are widely used to this day. The meaning of a word in Word2Vec and GloVe models represents its context. Syntactic or semantic relationships between words are preserved, and the vector distances between individual words correspond to human perception of the relationship between words. Word2Vec and GloVe generate a vector for each word, which can be further utilized. Unlike GPT, ELMo, or BERT, we don't need a model trained on a corpus for further text processing. It's important to know how to set the size of the context window and the dimension size for Word2Vec and GloVe models, as an improper combination of these parameters can lead to low-quality word vectors. In our article, we experimented with these parameters. The results show that it's necessary to choose an appropriate window size based on the embedding method used. In terms of dimension size, according to our results, dimensions smaller than 50 are no longer suitable. On the other hand, with dimensions larger than 150, the results did not significantly improve.

Publisher

Springer Science and Business Media LLC

Reference29 articles.

1. M. Liang and T. Niu, “Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs,” Procedia Comput Sci, vol. 208, pp. 460–470, 2022, doi: 10.1016/j.procs.2022.10.064.

2. Ontology-based semantic retrieval of documents using Word2vec model;Sharma A;Data Knowl Eng,2023

3. Combining FastText and Glove Word Embedding for Offensive and Hate speech Text Detection;Badri N;Procedia Comput Sci,2022

4. E. M. Dharma, F. Lumban Gaol, H. Leslie, H. S. Warnars, and B. Soewito, “THE ACCURACY COMPARISON AMONG WORD2VEC, GLOVE, AND FASTTEXT TOWARDS CONVOLUTION NEURAL NETWORK (CNN) TEXT CLASSIFICATION,” J Theor Appl Inf Technol, vol. 31, no. 2, 2022, [Online]. Available: www.jatit.org

5. J. M. Wyatt, G. J. Booth, and A. H. Goldman, “Natural Language Processing and Its Use in Orthopaedic Research,” Curr Rev Musculoskelet Med, vol. 14, no. 6, pp. 392–396, Dec. 2021, doi: 10.1007/s12178-021-09734-3.