Characterization and classification of semantic image-text relations-Reference-Cited by-同舟云学术

Characterization and classification of semantic image-text relations

Published:2020-01-22 Issue:1 Volume:9 Page:31-45
ISSN:2192-6611
Container-title:International Journal of Multimedia Information Retrieval
language:en
Short-container-title:Int J Multimed Info Retr

Author:

Otto Christian^ORCID,Springstein Matthias,Anand Avishek,Ewerth Ralph^ORCID

Abstract

AbstractThe beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.

Funder

Leibniz-Gemeinschaft

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Media Technology,Information Systems

Link

http://link.springer.com/content/pdf/10.1007/s13735-019-00187-6.pdf

Reference53 articles.

1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

2. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Conference track proceedings 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015

3. Balaneshin-kordan S, Kotov A (2018) Deep neural architecture for multi-modal retrieval based on joint embedding space for text and images. In: Proceedings of the eleventh ACM international conference on web search and data mining. ACM, pp 28–36

4. Baltrusaitis T, Ahuja C, Morency L (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

5. Barthes R (1977) Image-music-text, vol 332. Fontana, London

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multimodal text-image synergy in representing interpersonal relations in picture books;Cognition, Communication, Discourse;2024-08-25

2. How verbal text guides the interpretation of advertisement images: a predictive typology of verbal anchoring;Communication Theory;2024-07-22

3. Disaster assessment from social media using multimodal deep learning;Multimedia Tools and Applications;2024-07-11

4. Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions;ACM Computing Surveys;2024-06-22

5. Report on the 1st Workshop on Diffusion of Harmful Content on Online Web (DHOW) at WebSci 2024;Companion Proceedings of the 16th ACM Web Science Conference;2024-05-21