Author:
Viegas Felipe,Rocha Leonardo,Gonçalves Marcos André
Abstract
The ability to represent data in meaningful and tractable ways is crucial for Natural Language Processing (NLP) applications. This Ph.D. dissertation focused on proposing, designing and evaluating a novel textual document representation that exploits the “best of two worlds”: efficient and effective frequentist information (TFIDF representations) with semantic information derived from word embedding representations. In more details, our proposal – called CluWords – groups syntactically and semantically related words into clusters and applies domain-specific and application-oriented filtering and weighting schemes over them to build powerful document representations especially tuned for the task in hand. We apply our novel Cluword concept to four NLP applications: topic modeling, hierarchical topic modeling, sentiment lexicon building and sentiment analysis. Some of the novel contributions of this dissertation include: (i) the introduction of a new data representation composed of three general steps (clustering, filtering, and weighting). These steps are specially designed to overcome task-specific challenges related to noise and lack of information; (ii) the design of CluWords’ components capable of improving the effectiveness of Topic Modeling, Hierarchical Topic Modeling and Sentiment Analysis applications; (iii) the proposal of two new topic quality metrics to assess the topical quality of the hierarchical structures. Our extensive experimentation demonstrates that CluWords produce the current state-of-the-art topic modeling and hierarchical topic modeling. For sentiment analysis, our experiments show that CluWords filtering and weighting can mitigate semantic noise, surpassing powerful Transformer architectures in the task. All code and datasets produced in this dissertation are available for replication. Our results were published in some of the most important conferences in journals of the field, as detailed in this document. Our work was supported by two Google Research Awards.
Publisher
Sociedade Brasileira de Computação - SBC
Reference14 articles.
1. Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL’14.
2. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
3. Dufter, P., Kassner, N., and Schütze, H. (2021). Static embeddings as efficient knowledge bases? Greene, D., O’Callaghan, D., and Cunningham, P. (2014). How many topics? stability analysis for topic models. CoRR.
4. Hutto, C. J. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM’14.
5. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.