Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention-Reference-Cited by-同舟云学术

Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention

Published:2022-09-17 Issue: Volume:2022 Page:1-13
ISSN:1687-9732
Container-title:Applied Computational Intelligence and Soft Computing
language:en
Short-container-title:Applied Computational Intelligence and Soft Computing

Author:

S Kavi Priya¹^ORCID,K Pon Karthika¹^ORCID,Kaliappan Jayakumar²^ORCID,Selvaraj Senthil Kumaran³^ORCID,R Nagalakshmi⁴,Molla Baye⁵^ORCID

Affiliation:

1. Department of Computer Science and Engineering, Mepco Schlenk Engineering College (Autonomous), Sivakasi 626005, Tamil Nadu, India

2. Department of Analytics, School of Computer Science and Engineering, Vellore Institute of Technology (VIT), Vellore 632014, Tamil Nadu, India

3. Department of Manufacturing Engineering, School of Mechanical Engineering (SMEC), Vellore Institute of Technology (VIT), Vellore 632014, Tamil Nadu, India

4. Department of Computer Science and Engineering, Faculty of Engineering and Technology, Kalinga University, Raipur, Chhattisgarh, India

5. School of Mechanical Engineering, Engineering and Technology College, Dilla University, P.O.Box. 419, Dilla, Ethiopia

Abstract

Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image.

Publisher

Hindawi Limited

Subject

Artificial Intelligence,Computer Networks and Communications,Computer Science Applications,Civil and Structural Engineering,Computational Mechanics

Link

http://downloads.hindawi.com/journals/acisc/2022/2756396.pdf

Reference31 articles.

1. IoT based automation of real time in-pipe contamination detection system in drinking water;S. K. Priya

2. Heuristic routing with bandwidth and energy constraints in sensor networks;S. K. Priya;Applied Soft Computing,2015

3. A CNN-LSTM network with multi-level feature extraction-based approach for automated detection of coronavirus from CT scan and X-ray images

4. Multiple features based approach for automatic fake news detection on social networks using deep learning

5. Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. AMPS: Predicting popularity of short-form videos using multi-modal attention mechanisms in social media marketing environments;Journal of Retailing and Consumer Services;2024-05

2. Enhanced Image Captioning Using Bahdanau Attention Mechanism and Heuristic Beam Search Algorithm;IEEE Access;2024

3. Cross-modal representation learning and generation;Journal of Image and Graphics;2023