Audio Captioning with Composition of Acoustic and Semantic Information-Reference-Cited by-同舟云学术

Audio Captioning with Composition of Acoustic and Semantic Information

Published:2021-06 Issue:02 Volume:15 Page:143-160
ISSN:1793-351X
Container-title:International Journal of Semantic Computing
language:en
Short-container-title:Int. J. Semantic Computing

Author:

Özkaya Eren Ayşegül¹,Sert Mustafa¹

Affiliation:

1. Department of Computer Engineering, Başkent University, Ankara, Turkey

Abstract

Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder–decoder-based models without considering semantic information. To fill this gap, we present a novel encoder–decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder–decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model the audio captioning task. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance.

Publisher

World Scientific Pub Co Pte Lt

Subject

Artificial Intelligence,Computer Networks and Communications,Computer Science Applications,Linguistics and Language,Information Systems,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S1793351X21400018

Reference19 articles.

1. Automated audio captioning with recurrent neural networks

2. Environmental Audio Scene and Sound Event Recognition for Autonomous Surveillance

3. Adaptive Edge and Fog Computing Paradigm for Wide Area Video and Audio Surveillance

4. Audio event detection from acoustic unit occurrence patterns

5. TUT database for acoustic scene classification and sound event detection

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automated Audio Captioning With Topic Modeling;IEEE Access;2023