Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
Author:
Suzuki TomoyukiORCID, Aoki YoshimitsuORCID
Abstract
Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them.
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference59 articles.
1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017, January 4–9). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA. 2. Bertasius, G., Wang, H., and Torresani, L. (2021, January 18–24). Is space-time attention all you need for video understanding?. Proceedings of the International Conference on Machine Learning, Virtual Event. 3. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C. (2021, January 10–17). Vivit: A video vision transformer. Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada. 4. Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., and Tighe, J. (2021, January 10–17). VidTr: Video transformer without convolutions. Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada. 5. Chen, J., and Ho, C.M. (2022, January 3–8). MM-ViT: Multi-modal video transformer for compressed video action recognition. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.
|
|