1. End-to-End Object Detection with Transformers
2. Real-time referring expression comprehension by single-stage grounding network;Chen Xinpeng;arXiv preprint arXiv:1812.03426,2018
3. Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
4. Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE International Conference on Computer Vision. 1769–1779.
5. BERT: Pre-training of deep bidirectional transformers for language understanding;Devlin Jacob;arXiv:1810.04805,2018