Author:
Sadeghi Reyhaneh,Akbari Ahmad,Jaziriyan Mohammad Mehdi
Abstract
AbstractTwitter is a rich resource for analyzing the contents of social media and extracting the age groups of users can be beneficial for recommender systems, marketing and advertising. Age detection task is an aspect of demographic information of users. In this study a large-scale corpus of Arabic Twitter users including 181k user profiles with diverse age groups consisting of −18, 18–24, 25–34, 35–49, 50–64, +65 is presented. The corpus is created by four methods: (1) collecting publicly available birthday announcement tweets using the Twitter Search application programming interface, (2) augmenting data, (3) fetching verified accounts, and (4) manual annotation. To have a best age detection model on the presented corpus, different evaluations are tested to find the model with highest accuracy and efficiency. Number of tweets, regression vs. classification, using metadata of users and tweets, using LSTM+CNN model vs. BERT are some parts of examinations done. Presented methodology is based on language and metadata features and final model is fine-tuned with BERT on 70k users and evaluated on 8200 manually annotated users. We show that our best model, compared with LSTM+CNN model and BERT-based similar model yields an improvement of up to 9% in F1-score and increment of 5% in accuracy, respectively. The model achieved macro-averaged F1-score of 44 on six age groups, and F1-score of 58 on three age groups of −25, 25–34, +35. The link of our proposed data is provided here: www.github.com/exaco/ExaAUAC.
Publisher
Springer Science and Business Media LLC
Reference22 articles.
1. Abdul-Mageed M, Elmadany A, Nagoudi E. ARBERT & MARBERT: deep bidirectional transformers for Arabic. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021; 7088-7105
2. Antoun W, Baly F, Hajj H. Arabert: transformer-based model for arabic language understanding. LREC 2020 Workshop Language Resources and Evaluation Conference, 2020; 9
3. Bradski G. The OpenCV Library. Dr Dobb’s J Softw Tools. 2000;25:120. https://github.com/opencv/opencv/wiki/CiteOpenCV.
4. Chamberlain B, Humby C, Deisenroth M. Probabilistic inference of twitter users’ age based on what they follow. Lecture Notes In Computer Science (including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics). 10536 LNAI 2017; 191-203
5. Culotta A, Ravi N, Cutler J. Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res. 2016;55:389–408.