Author:
Goldstein Ariel,Wang Haocheng,Niekerken Leonard,Zada Zaid,Aubrey Bobbi,Sheffer Tom,Nastase Samuel A.,Gazula Harshvardhan,Schain Mariano,Singh Aditi,Rao Aditi,Choe Gina,Kim Catherine,Doyle Werner,Friedman Daniel,Devore Sasha,Dugan Patricia,Hassidim Avinatan,Brenner Michael,Matias Yossi,Devinsky Orrin,Flinker Adeen,Hasson Uri
Abstract
AbstractHumans effortlessly use the continuous acoustics of speech to communicate rich linguistic meaning during everyday conversations. In this study, we leverage 100 hours (half a million words) of spontaneous open-ended conversations and concurrent high-quality neural activity recorded using electrocorticography (ECoG) to decipher the neural basis of real-world speech production and comprehension. Employing a deep multimodal speech-to-text model named Whisper, we develop encoding models capable of accurately predicting neural responses to both acoustic and semantic aspects of speech. Our encoding models achieved high accuracy in predicting neural responses in hundreds of thousands of words across many hours of left-out recordings. We uncover a distributed cortical hierarchy for speech and language processing, with sensory and motor regions encoding acoustic features of speech and higher-level language areas encoding syntactic and semantic information. Many electrodes—including those in both perceptual and motor areas—display mixed selectivity for both speech and linguistic features. Notably, our encoding model reveals a temporal progression from language-to-speech encoding before word onset during speech production and from speech-to-language encoding following word articulation during speech comprehension. This study offers a comprehensive account of the unfolding neural responses during fully natural, unbounded daily conversations. By leveraging a multimodal deep speech recognition model, we highlight the power of deep learning for unraveling the neural mechanisms of language processing in real-world contexts.
Publisher
Cold Spring Harbor Laboratory