Semantic Positioning Model Incorporating BERT/RoBERTa and Fuzzy Theory Achieves More Nuanced Japanese Adverb Clustering

Author:

Odle Eric1ORCID,Hsueh Yun-Ju2,Lin Pei-Chun3ORCID

Affiliation:

1. Department of Natural History Sciences, Graduate School of Science, Hokkaido University, Sapporo 060-0810, Japan

2. Department of Foreign Languages and Applied Linguistics, Yuan Ze University, Taoyuan City 320315, Taiwan

3. Department of Information Engineering and Computer Science, Feng Chia University, Taichung City 407102, Taiwan

Abstract

Japanese adverbs are difficult to classify, with little progress made since the 1930s. Now in the age of large language models, linguists need a framework for lexical grouping that incorporates quantitative, evidence-based relationships rather than purely theoretical categorization. We herein address this need for the case of Japanese adverbs by developing a semantic positioning approach that incorporates large language model embeddings with fuzzy set theory to achieve empirical Japanese adverb groupings. To perform semantic positioning, we (i) obtained multi-dimensional embeddings for a list of Japanese adverbs using a BERT or RoBERTa model pre-trained on Japanese text, (ii) reduced the dimensionality of each embedding by principle component analysis (PCA), (iii) mapped the relative position of each adverb in a 3D plot using K-means clustering with an initial cluster count of n=3, (iv) performed silhouette analysis to determine the optimal cluster count, (v) performed PCA and K-means clustering on the adverb embeddings again to generate 2D semantic position plots, then finally (vi) generated a centroid distance matrix. Fuzzy set theory informs our workflow at the embedding step, where the meanings of words are treated as quantifiable vague data. Our results suggest that Japanese adverbs optimally cluster into n=4 rather than n=3 groups following silhouette analysis. We also observe a lack of consistency between adverb semantic positions and conventional classification. Ultimately, 3D/2D semantic position plots and centroid distance matrices were simple to generate and did not require special hardware. Our novel approach offers advantages over conventional adverb classification, including an intuitive visualization of semantic relationships in the form of semantic position plots, as well as a quantitative clustering “fingerprint” for Japanese adverbs that express vague language data as a centroid distance matrix.

Funder

Ministry of Education, R.O.C.

MOE Teaching Practice Research Program

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Reference59 articles.

1. Advances in natural language processing;Hirschberg;Science,2015

2. Robust natural language processing: Recent advances, challenges, and future directions;Omar;IEEE Access,2022

3. Machine learning: Trends, perspectives, and prospects;Jordan;Science,2015

4. Machine learning algorithms—A review;Mahesh;Int. J. Sci. Res. IJSR,2020

5. Explaining deep neural networks and beyond: A review of methods and applications;Samek;Proc. IEEE,2021

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Clinical Text Analysis with Natural Language Processing: A BERT-based Approach;2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE);2024-05-09

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3