Author:
Tran Van Trung,Le Quang Dao,Pham Bao Son,Luu Viet Hung,Bui Quang Hung
Abstract
Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%).
Reference22 articles.
1. “Snorkel drybell: a case study in deploying weak supervision at industrial scale,”;Bach,2019
2. Crowdsourced geospatial data quality: challenges and future directions;Basiri;Int. J. Geograph. Inf. Sci,2019
3. “Creating a dataset for named entity recognition in the archaeology domain,”;Brandsen,2020
4. “Improving sequence tagging for Vietnamese text using transformer-based neural models,”;Bui,2020
5. “A poi categorization by composition of onomastic and contextual information,”;Choi,2014
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献