Author:
Younes Jihene,Achour Hadhemi,Souissi Emna,Ferchichi Ahmed
Abstract
Language identification is an important task in natural language processing that consists in determining the language of a given text. It has increasingly picked the interest of researchers for the past few years, especially for code-switching informal textual content. In this paper, we focus on the identification of the Romanized user-generated Tunisian dialect on the social web. We segment and annotate a corpus extracted from social media and propose a deep learning approach for the identification task. We use a Bidirectional Long Short-Term Memory neural network with Conditional Random Fields decoding (BLSTM-CRF). For word embeddings, we combine word-character BLSTM vector representation and Fast Text embeddings that takes into consideration character n-gram features. The overall accuracy obtained is 98.65%.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献