1. Attention bottlenecks for multimodal fusion;Nagrani,2021
2. What makes training multi-modal classification networks hard?;Wang,2020
3. MM-ViT: Multi-modal video transformer for compressed video action recognition;Chen,2021
4. Gated multimodal units for information fusion;Arevalo,2019
5. Scaling egocentric vision: The EPIC-KITCHENS dataset;Damen,2018