Sound Can Help Us See More Clearly-Reference-Cited by-同舟云学术

Sound Can Help Us See More Clearly

Published:2022-01-13 Issue:2 Volume:22 Page:599
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Li Yongsheng^ORCID,Tu Tengfei,Zhang Hua,Li Jishuai^ORCID,Jin Zhengping,Wen Qiaoyan

Abstract

In the field of video action classification, existing network frameworks often only use video frames as input. When the object involved in the action does not appear in a prominent position in the video frame, the network cannot accurately classify it. We introduce a new neural network structure that uses sound to assist in processing such tasks. The original sound wave is converted into sound texture as the input of the network. Furthermore, in order to use the rich modal information (images and sound) in the video, we designed and used a two-stream frame. In this work, we assume that sound data can be used to solve motion recognition tasks. To demonstrate this, we designed a neural network based on sound texture to perform video action classification tasks. Then, we fuse this network with a deep neural network that uses continuous video frames to construct a two-stream network, which is called A-IN. Finally, in the kinetics dataset, we use our proposed A-IN to compare with the image-only network. The experimental results show that the recognition accuracy of the two-stream neural network model with uesed sound data features is increased by 7.6% compared with the network using video frames. This proves that the rational use of the rich information in the video can improve the classification effect.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/2/599/pdf

Reference55 articles.

1. Soundnet: Learning sound representations from unlabeled video;Aytar;Adv. Neural Inf. Process. Syst.,2016

2. See, hear, and read: Deep aligned representations;Aytar;arXiv,2017

3. Learning a text-video embedding from incomplete and heterogeneous data;Miech;arXiv,2018

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Transformer for Skeleton-based action recognition: A review of recent advances;Neurocomputing;2023-06