Author:
Rodrigues João Pedro,Paraiso Emerson
Abstract
In this work, the technical feasibility of working with audio transcriptions from Youtube is analyzed, as well as presenting a method that allows data acquisition, pre-processing, and post-processing to work with this type of data. A topic modeling approach with the latent dirichlet allocation algorithm is used. An approach is also presented to dynamically determine the ideal number of topics that make up a given corpus. In the experiments, a database of 250 audio transcriptions was used, obtaining a model with coherence in the range of 40%.
Publisher
Sociedade Brasileira de Computação
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Identifying Sponsored Content in YouTube using Information Extraction;2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC);2021-10-17