Author:
Javed Hira,Akhtar Nadeem,Beg M. M. Sufyan
Abstract
With the increase in multimedia content, the domain of multimodal processing is experiencing constant growth. The question of whether combining these modalities is beneficial may come up. In this work, we investigate this by working on multi-modal content for obtaining quality summaries. We have conducted several experiments on the extractive summarization process employing asynchronous text, audio, image,and video. Information present in the multimedia content has been leveraged to bridge the semantic gaps between different modes. Vision Transformers and BERT have been used for the imagematching and similarity-checking tasks. Furthermore, audio transcriptions have been used for incorporating the audio information in the summaries. The obtained news summaries have been evaluated with Rouge Score and a comparative analysis has been done.