MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions-Reference-Cited by-同舟云学术

MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Published:2024-01-25 Issue:2 Volume:9 Page:21
ISSN:2306-5729
Container-title:Data
language:en
Short-container-title:Data

Author:

Shaikh Muhammad Bilal¹^ORCID,Chai Douglas¹^ORCID,Islam Syed Mohammed Shamsul²^ORCID,Akhtar Naveed³^ORCID

Affiliation:

1. School of Engineering, Edith Cowan University, 270 Joondalup Drive, Joondalup, Perth, WA 6027, Australia

2. School of Science, Edith Cowan University, 270 Joondalup Drive, Joondalup, Perth, WA 6027, Australia

3. School of Computing and Information Systems, The University of Melbourne, Melbourne Connect, 700 Swanston Street, Carlton, WA 3053, Australia

Abstract

Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i.e., UCF101. Each data sample captured a duration of approximately 10 s long, and the overall dataset was split into 4893 training samples and 1944 testing samples. The resulting feature sequences were then converted into images, which can be used for human action recognition and other related tasks. These images can be used as a benchmark dataset for evaluating the performance of machine learning models for human action recognition and related tasks. These audio-image representations could be suitable for a wide range of applications, such as surveillance, healthcare monitoring, and robotics. The dataset can also be used for transfer learning, where pre-trained models can be fine-tuned on a specific task using specific audio images. Thus, this dataset can facilitate the development of new techniques and approaches for improving the accuracy of human action-related tasks and also serve as a standard benchmark for testing the performance of different machine learning models and algorithms.

Funder

Edith Cowan University (ECU), Australia and Higher Education Commission (HEC), Pakistan

Publisher

MDPI AG

Link

https://www.mdpi.com/2306-5729/9/2/21/pdf

Reference43 articles.

1. Shaikh, M.B., and Chai, D. (2021). RGB-D data-based action recognition: A review. Sensors, 21.

2. Shaikh, M.B., Chai, D., Islam, S.M.S., and Akhtar, N. (2022, January 13–16). MAiVAR: Multimodal Audio-Image and Video Action Recognizer. Proceedings of the International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China.

3. Sudhakaran, S., Escalera, S., and Lanz, O. (2020, January 13–19). Gate-shift networks for video action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.

4. Yang, G., Yang, Y., Lu, Z., Yang, J., Liu, D., Zhou, C., and Fan, Z. (2022). STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. PLoS ONE, 17.

5. Zhang, K., Li, D., Huang, J., and Chen, Y. (2020). Automated video behavior recognition of pigs using two-stream convolutional networks. Sensors, 20.