Abstract
AbstractThis paper presents a machine learning-based classifier for detecting points of interest through the combined use of images and text from social networks. This model exploits the transfer learning capabilities of the neural network architecture CLIP (Contrastive Language-Image Pre-Training) in multimodal environments using image and text. Different methodologies based on multimodal information are explored for the geolocation of the places detected. To this end, pre-trained neural network models are used for the classification of images and their associated texts. The result is a system that allows creating new synergies between images and texts in order to detect and geolocate trending places that has not been previously tagged by any other means, providing potentially relevant information for tasks such as cataloging specific types of places in a city for the tourism industry. The experiments carried out reveal that, in general, textual information is more accurate and relevant than visual cues in this multimodal setting.
Funder
Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital, Generalitat Valenciana
European Regional Development Fund
Universidad de Alicante
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Hardware and Architecture,Media Technology,Software
Reference43 articles.
1. Afyouni I, Aghbari ZA, Razack RA (2022) Multi-feature, multi-modal, and multi-source social event detection: a comprehensive survey. Inf Fusion 79 (2021):279–308. https://doi.org/10.1016/j.inffus.2021.10.013
2. Arora G, Pavani PL, Kohli R, Bibhu V (2016) Multimodal biometrics for improvised security. 2016 1st Int Conf Innovation Challenges in Cyber Secur, ICICCS 2016 (Iciccs):1–5. https://doi.org/10.1109/ICICCS.2016.7542312https://doi.org/10.1109/ICICCS.2016.7542312
3. Chang M-W, Ratinov L, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. In: Proceedings of the 23rd national conference on artificial intelligence - vol 2. AAAI’08. AAAI press, pp 830–835
4. Cheng J, Fostiropoulos I, Boehm B, Soleymani M (2021) Multimodal phased transformer for sentiment analysis. EMNLP 2021 - 2021 conference on empirical methods in natural language processing, proceedings, pp 2447–2458. https://doi.org/10.18653/v1/2021.emnlp-main.189
5. Cho J, Lei J, Tan H, Bansal M (2021) Unifying vision-and-language tasks via text generation. arXiv:2102.02779
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献