Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification

Author:

Nedumpozhimana Vasudevan,Klubička Filip,Kelleher John D.

Abstract

This article examines the basis of Natural Language Understanding of transformer based language models, such as BERT. It does this through a case study on idiom token classification. We use idiom token identification as a basis for our analysis because of the variety of information types that have previously been explored in the literature for this task, including: topic, lexical, and syntactic features. This variety of relevant information types means that the task of idiom token identification enables us to explore the forms of linguistic information that a BERT language model captures and encodes in its representations. The core of this article presents three experiments. The first experiment analyzes the effectiveness of BERT sentence embeddings for creating a general idiom token identification model and the results indicate that the BERT sentence embeddings outperform Skip-Thought. In the second and third experiment we use the game theory concept of Shapley Values to rank the usefulness of individual idiomatic expressions for model training and use this ranking to analyse the type of information that the model finds useful. We find that a combination of idiom-intrinsic and topic-based properties contribute to an expression's usefulness in idiom token identification. Overall our results indicate that BERT efficiently encodes a variety of information from topic, through lexical and syntactic information. Based on these results we argue that notwithstanding recent criticisms of language model based semantics, the ability of BERT to efficiently encode a variety of linguistic information types does represent a significant step forward in natural language understanding.

Funder

Dublin Institute of Technology

Publisher

Frontiers Media SA

Subject

General Medicine

Reference36 articles.

1. Climbing towards NLU: on meaning, form, and understanding in the age of data;Bender,2020

2. What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties;Conneau,2018

3. The VNC-tokens dataset;Cook,2008

4. Indexing by latent semantic analysis;Deerwester;J. Am. Soc. Inf. Sci.,1990

5. BERT: pre-training of deep bidirectional transformers for language understanding;Devlin;CoRR,2018

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Semantics of Multiword Expressions in Transformer-Based Models: A Survey;Transactions of the Association for Computational Linguistics;2024

2. Local or Global: The Variation in the Encoding of Style Across Sentiment and Formality;Artificial Neural Networks and Machine Learning – ICANN 2023;2023

3. Getting BART to Ride the Idiomatic Train: Learning to Represent Idiomatic Expressions;Transactions of the Association for Computational Linguistics;2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3