Using Twitter to collect a multi-dialectal corpus of Albanian using advanced geotagging and dialect modeling

Author:

Canhasi ErcanORCID,Shijaku Rexhep

Abstract

In this study, we present the acquisition and categorization of a geographically-informed, multi-dialectal Albanian National Corpus, derived from Twitter data. The primary dialects from three distinct regions—Albania, Kosovo, and North Macedonia—are considered. The assembled publicly available dataset encompasses anonymized user information, user-generated tweets, auxiliary tweet-related data, and annotations corresponding to dialect categories. Utilizing a highly automated scraping approach, we initially identified over 1,000 Twitter users with discernible locations who actively employ at least one of the targeted Albanian dialects. Subsequent data extraction phases yielded an augmentation of the preliminary dataset with an additional 1,500 Twitterers. The study also explores the application of advanced geotagging techniques to expedite corpus generation. Alongside experimentation with diverse classification methodologies, comprehensive feature engineering and feature selection investigations were conducted. A subjective assessment is conducted using human annotators, which demonstrates that humans achieve significantly lower accuracy rates in comparison to machine learning (ML) models. Our findings indicate that machine learning algorithms are proficient in accurately differentiating various Albanian dialects, even when analyzing individual tweets. A meticulous evaluation of the most salient attributes of top-performing algorithms provides insights into the decision-making mechanisms utilized by these models. Remarkably, our investigation revealed numerous dialectal patterns that, despite being familiar to human annotators, have not been widely acknowledged within the broader scientific community.

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference62 articles.

1. Dialectology

2. A Unification of Morphology and Syntax

3. Building an Albanian text corpus for linguistic research;Besim Kabashi;Kumtesë në konferencën “Corpus-Based Approaches to the Balkan Languages and Dialects,2016

4. Shala, Flamur. Language conformity and the use of standard Albanian language. Available at SSRN 3363688, 2019.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3