LDA filter: A Latent Dirichlet Allocation preprocess method for Weka-Reference-Cited by-同舟云学术

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka

Published:2020-11-09 Issue:11 Volume:15 Page:e0241701
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Celard P.^ORCID,Vieira A. Seara,Iglesias E. L.,Borrajo L.^ORCID

Abstract

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference28 articles.

1. Data mining in bioinformatics using Weka;E Frank;Bioinformatics,2004

2. Nikolaos T, George T. Document classification system based on HMM word map. In: Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology. CSTST’08. New York, NY, USA: ACM; 2008. p. 7–12.

3. Probabilistic topic models;DM Blei;Communications of the ACM,2012

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Modified LDA vector and feedback analysis for short query Information Retrieval systems;Logic Journal of the IGPL;2024-05-04

2. Transforming Education Policy: Evaluating UAQTE Program Implementation Through LDA, BoW and TF-IDF Techniques;2024 26th International Conference on Advanced Communications Technology (ICACT);2024-02-04

3. Personality adjectives in Serbian Tweets: An opening;Primenjena psihologija;2023-12-28

4. Publication Dynamics on Social Media During the Orpea Nursing Homes Scandal: A Twitter Analysis;Caring is Sharing – Exploiting the Value in Data for Health and Innovation;2023-05-18

5. Changes in Food Security, Healthfulness, and Access During the Coronavirus Disease 2019 Pandemic: Results From a National United States Survey;Current Developments in Nutrition;2023-03