1. Deep learning model for house price prediction using heterogeneous data analysis along with joint self-attention mechanism;Wang;IEEE Access,2021
2. Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, C. (November, January 27). Videobert: A joint model for video and language representation learning. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea.
3. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., and Chang, K.W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv.
4. Ma, P., Mira, R., Petridis, S., Schuller, B.W., and Pantic, M. (2021). LiRA: Learning visual speech representations from audio through self-supervision. arXiv.
5. Shi, B., Hsu, W.N., Lakhotia, K., and Mohamed, A. (2022). Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv.