1. J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, 2016. arXiv preprint arXiv:1607.06450.
2. Activitynet: a large-scale video benchmark for human activity understanding;Caba Heilbron,2015
3. Temporal deformable convolutional encoder-decoder networks for video captioning;Chen,2019
4. Deep residual learning for image recognition;He;IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016
5. Bilinear attention networks;Kim,2018