1. Aditya, S., Yang, Y., & Baral, C. (2018). Explicit reasoning over end-to-end neural architectures for visual question answering. In Thirty-second AAAI conference on artificial intelligence.
2. Aliakbarian, M. S., Saleh, F. S., Salzmann, M., Fernando, B., Petersson, L., & Andersson, L. (2017). Encouraging lstms to anticipate actions very early. In Proceedings of the IEEE conference on computer vision and pattern recognition.
3. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition.
4. Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Eighteenth ACM-SIAM symposium on discrete algorithms.
5. Bhoi, A. (2019). Spatio-temporal action recognition: A survey. arXiv preprint arXiv:1901.09403.