Affiliation:
1. University of Reading, UK
Abstract
Digitalised multimedia information today is typically represented in different modalities and distributed through various channels. The use of such a huge amount of data is highly dependent on effective and efficient cross-modal labelling, indexing and retrieval of multimodal information. In this Chapter, we mainly focus on the combining of the primary and collateral modalities of the information resource in an intelligent and effective way in order to provide better multimodal information understanding, classification, labelling and retrieval. Image and text are the two modalities we mainly talk about here. A novel framework for semantic-based collaterally cued image labelling had been proposed and implemented, aiming to automatically assign linguistic keywords to regions of interest in an image. A visual vocabulary was constructed based on manually labelled image segments. We use Euclidean distance and Gaussian distribution to map the low-level region-based image features to the high-level visual concepts defined in the visual vocabulary. Both the collateral content and context knowledge were extracted from the collateral textual modality to bias the mapping process. A semantic-based high-level image feature vector model was constructed based on the labelling results, and the performance of image retrieval using this feature vector model appears to outperform both content-based and text-based approaches in terms of its capability for combining both perceptual and conceptual similarity of the image content.
Reference41 articles.
1. Anastopoulou, S., Baber, C., & Sharples, M. (2001) “Multimedia and multimodal systems: commonalities and differences”. Proc. of the 5th Human Centred Technology Postgraduate Workshop, University of Sussex.
2. Athanasisadis, T., Mylonas, P., Avrithis, Y., Kollias, S. (2007) “Semantic image segmentation and object labelling”, IEEE Trans. On Circuits and systems for video technology, vol. 17, no, 3, pp 298-312.
3. Using critical path analysis to model multimodal human–computer interaction
4. 10.1162/153244303322533214
5. Barnard, K., & Forsyth, D. (2001) “Learning the Semantics of Words and Pictures”. Proceedings of Int. Conf. on Computer Vision, pp. II: 408-415.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献