Exploration–Exploitation Mechanisms in Recurrent Neural Networks and Human Learners in Restless Bandit Problems-Reference-Cited by-同舟云学术

Exploration–Exploitation Mechanisms in Recurrent Neural Networks and Human Learners in Restless Bandit Problems

Published:2024-05-24 Issue:3 Volume:7 Page:314-356
ISSN:2522-0861
Container-title:Computational Brain & Behavior
language:en
Short-container-title:Comput Brain Behav

Author:

Tuzsus D.,Brands A.,Pappas I.,Peters J.

Abstract

AbstractA key feature of animal and human decision-making is to balance the exploration of unknown options for information gain (directed exploration) versus selecting known options for immediate reward (exploitation), which is often examined using restless bandit tasks. Recurrent neural network models (RNNs) have recently gained traction in both human and systems neuroscience work on reinforcement learning, due to their ability to show meta-learning of task domains. Here we comprehensively compared the performance of a range of RNN architectures as well as human learners on restless four-armed bandit problems. The best-performing architecture (LSTM network with computation noise) exhibited human-level performance. Computational modeling of behavior first revealed that both human and RNN behavioral data contain signatures of higher-order perseveration, i.e., perseveration beyond the last trial, but this effect was more pronounced in RNNs. In contrast, human learners, but not RNNs, exhibited a positive effect of uncertainty on choice probability (directed exploration). RNN hidden unit dynamics revealed that exploratory choices were associated with a disruption of choice predictive signals during states of low state value, resembling a win-stay-loose-shift strategy, and resonating with previous single unit recording findings in monkey prefrontal cortex. Our results highlight both similarities and differences between exploration behavior as it emerges in meta-learning RNNs, and computational mechanisms identified in cognitive and systems neuroscience work.

Funder

Deutsche Forschungsgemeinschaft

Universität zu Köln

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s42113-024-00202-y.pdf

Reference130 articles.

1. Agrawal, S., & Goyal, N. (2012). Analysis of Thompson Sampling for the multi-armed bandit problem (arXiv:1111.1797). arXiv. https://doi.org/10.48550/arXiv.1111.1797

2. An, G. (1996). The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8(3), 643–674. https://doi.org/10.1162/neco.1996.8.3.643

3. Apergis-Schoute, A., & Ip, H. Y. S. (2020). Reversal Learning in Obsessive Compulsive Disorder: Uncertainty, Punishment. Serotonin and Perseveration. Biological Psychiatry, 87(9), S125–S126. https://doi.org/10.1016/j.biopsych.2020.02.339

4. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2), 235–256. https://doi.org/10.1023/A:1013689704352

5. Badre, D., Doll, B. B., Long, N. M., & Frank, M. J. (2012). Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron, 73(3), 595–607. https://doi.org/10.1016/j.neuron.2011.12.025