Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?-Reference-Cited by-同舟云学术

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Published:2023 Issue: Volume:11 Page:336-350
ISSN:2307-387X
Container-title:Transactions of the Association for Computational Linguistics
language:en
Short-container-title:

Author:

Oh Byung-Doh¹,Schuler William²

Affiliation:

1. Department of Linguistics, The Ohio State University, USA. oh.531@osu.edu

2. Department of Linguistics, The Ohio State University, USA. schuler.77@osu.edu

Abstract

AbstractThis work presents a linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.

Publisher

MIT Press

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication

Link

https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00548/2075940/tacl_a_00548.pdf

Reference52 articles.

1. Syntactic surprisal from neural models predicts, but underestimates, human processing difficulty from syntactic ambiguities;Arehalli,2022

2. Comparing gated and simple recurrent neural network architectures as models of human sentence processing;Aurnhammer,2019

3. Fitting linear mixed-effects models using lme4;Bates;Journal of Statistical Software,2015

4. GPT-NeoX-20B: An open-source autoregressive language model;Black,2022

5. GPT-Neo: Large scale autoregressive language modeling with Mesh-Tensorflow;Black;Zenodo,2021

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Modeling Brain Representations of Words' Concreteness in Context Using GPT‐2 and Human Ratings;Cognitive Science;2023-12

2. Expectations modulate retrieval interference during ellipsis resolution;Neuropsychologia;2023-11

3. Procedural Strategies;Cognitive Plausibility in Natural Language Processing;2023-10-31

4. Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data;Behavior Research Methods;2023-10-25

5. Optimizing Predictive Metrics for Human Language Comprehension;2023-09-03