Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce-Reference-Cited by-同舟云学术

Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

Published:2020-04-15 Issue:4 Volume:11 Page:210
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Abidin Taufik Fuadi^ORCID,Mahazir Amir^ORCID,Subianto Muhammad^ORCID,Munadi Khairul^ORCID,Ferdhiana Ridha^ORCID

Abstract

During the previous decades, intelligent identification of acronym and expansion pairs from a large corpus has garnered considerable research attention, particularly in the fields of text mining, entity extraction, and information retrieval. Herein, we present an improved approach to recognize the accurate acronym and expansion pairs from a large Indonesian corpus. Generally, an acronym can be either a combination of uppercase letters or a sequence of speech sounds (syllables). Our proposed approach can be computationally divided into four steps: (1) acronym candidate identification; (2) acronym and expansion pair collection; (3) feature generation; and (4) acronym and expansion pair recognition using supervised learning techniques. Further, we introduce eight numerical features and evaluate their effectiveness in representing the acronym and expansion pairs based on the precision, recall, and F-measure. Furthermore, we compare the k-nearest neighbors (K-NN), support vector machine (SVM), and bidirectional encoder representations from transformers (BERT) algorithms in terms of accurate acronym and expansion pair classification. The experimental results indicate that the SVM polynomial model that considers eight features exhibits the highest accuracy (97.93%), surpassing those of the SVM polynomial model that considers five features (90.45%), the K-NN algorithm with k = 3 that considers eight features (96.82%), the K-NN algorithm with k = 3 that considers five features (95.66%), BERT-Base model (81.64%), and BERT-Base Multilingual Cased model (88.10%). Moreover, we analyze the performance of the Hadoop technology using various numbers of data nodes to identify the acronym and expansion pairs and obtain their feature vectors. The results reveal that the Hadoop cluster containing a large number of data nodes is faster than that with fewer data nodes when processing from ten million to one hundred million pairs of acronyms and expansions.

Funder

Kementerian Riset Teknologi Dan Pendidikan Tinggi Republik Indonesia

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/11/4/210/pdf

Reference39 articles.

1. Big Data technologies: A survey

2. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

3. Technologies and challenges in developing Machine-to-Machine applications: A survey

4. Integration of Cloud computing and Internet of Things: A survey

5. The Parable of Google Flu: Traps in Big Data Analysis

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Disambiguation of medical abbreviations for knowledge organization;Information Processing & Management;2023-09

2. How to generate data for acronym detection and expansion;Advances in Computational Intelligence;2022-04

3. Mining the web to discover acronym‐definitions based on sequence labeling and iterative query expansion model;Concurrency and Computation: Practice and Experience;2021-03-31