Author:
Kumar S.,Kumar M. Anand,Soman K.P.
Abstract
Abstract
The paper addresses the problem of part-of-speech (POS) tagging for Malayalam tweets. The conversational style of posts/tweets/text in social media data poses a challenge in using general POS tagset for tagging the text. For the current work, a tagset was designed that contains 17 coarse tags and 9915 tweets were tagged manually for experiment and evaluation. The tagged data were evaluated using sequential deep learning methods like recurrent neural network (RNN), gated recurrent units (GRU), long short-term memory (LSTM), and bidirectional LSTM (BLSTM). The training of the model was performed on the tagged tweets, at word level and character level. The experiments were evaluated using measures like precision, recall, f1-measure, and accuracy. During the experiment, it was found that the GRU-based deep learning sequential model at word level gave the highest f1-measure of 0.9254; at character-level, the BLSTM-based deep learning sequential model gave the highest f1-measure of 0.8739. To choose the suitable number of hidden states, we varied it as 4, 16, 32, and 64, and performed training for each. It was observed that the increase in hidden states improved the tagger model. This is an initial work to perform Malayalam Twitter data POS tagging using deep learning sequential models.
Subject
Artificial Intelligence,Information Systems,Software
Reference118 articles.
1. Improved part-of-speech tagging for online conversational text with word clusters;The Association for Computational Linguistics in the Proceedings of Human Language Technologies,2013
2. Part of speech tagging for Hindi corpus;International Conference on Communication Systems and Network Technologies (CSNT),2011
3. A unified architecture for natural language processing: deep neural networks with multitask learning;Proceedings of the 25th International Conference on Machine Learning,2008
Cited by
30 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献