Development of the Multimodal Handling Interface Based on Google API-Reference-Cited by-同舟云学术

Development of the Multimodal Handling Interface Based on Google API

Published:2024 Issue:1 Volume:6 Page:216-223
ISSN:2707-6784
Container-title:Computer Design Systems. Theory and Practice
language:
Short-container-title:CDS

Author:

,Basystiuk Oleh^ORCID,Melnykova Nataliya^ORCID,

Abstract

Today, Artificial Intelligence is a daily routine, becoming deeply entrenched in our lives. One of the most popular and rapidly advancing technologies is speech recognition, which forms an integral part of the broader concept of multimodal data handling. Multimodal data encompasses voice, audio, and text data, constituting a multifaceted approach to understanding and processing information. This paper presents the development of a multimodal handling interface leveraging Google API technologies. The interface aims to facilitate seamless integration and management of diverse data modalities, including text, audio, and video, within a unified platform. Through the utilization of Google API functionalities, such as natural language processing, speech recognition, and video analysis, the interface offers enhanced capabilities for processing, analysing, and interpreting multimodal data. The paper discusses the design and implementation of the interface, highlighting its features and functionalities. Furthermore, it explores potential applications and future directions for utilizing the interface in various domains, including healthcare, education, and multimedia content creation. Overall, the development of the multimodal handling interface based on Google API represents a significant step towards advancing multimodal data processing and enhancing user experience in interacting with diverse data sources.

Publisher

Lviv Polytechnic National University

Reference20 articles.

1. [1] Karpathy and L. Fei-Fei, "Deep visual-semantic alignmentsfor generating image descriptions," in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR), 2015, pp. 3128-3137 https://doi.org/10.1109/CVPR.2015.7298932

2. [2] Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen,and Tan Lee, "Editspeech: A text based speech editing systemusing partial inference and bidirectional fusion," arXiv preprintarXiv:2107.01554, 2021. https://doi.org/10.1109/ASRU51503.2021.9688051

3. [3] M. Oncescu, A. S. Koepke, J. F. Henriques, Z. Akata, andS. Albanie, "Audio Retrieval with Natural Language Queries,"in Proceedings of Conference of the International Speech Com-munication Association, 2021, pp. 2411-2415. https://doi.org/10.21437/Interspeech.2021-2227

4. [4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and YoshuaBengio, Deep learning, vol. 1, MIT press Cambridge, 2016

5. [5] Ivan Izonin, et. al., "The Combined Use of the Wiener Polynomial and SVM for Material Classification Task in Medical Implants Production", International Journal of Intelligent Systems and Applications (IJISA), Vol.10, No.9, pp.40-47, 2018. https://doi.org/10.5815/ijisa.2018.09.05