Analysis of the Microcontroller Resources Using Specifics for Speech Recognition

Author:

Ryzhova Anna Romanivna1ORCID,Onykiienko Yurii Oleksiiovych1ORCID

Affiliation:

1. National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine

Abstract

The use of neural networks for information recognition, in particular, voice, expands the functional capabilities of embedded systems on microcontrollers. But it is necessary to take into account the limitations of the microcontroller resources. The purpose of the work is to analyze the impact of voice processing parameters and neural network architecture on the degree of microcontroller resources usage. To do this, a database of samples of the keyword, samples of other words and voices, and samples of noise are created, the probability of recognizing the keyword among other words and noises is evaluated, the dependence of the amount of memory used on the microcontroller and the decision-making time on the number MFC coefficients is established, the dependence of the amount of used memory of the microcontroller and the decision-making time on the type of convolutional neural network is established also. During the experiment, the Arduino Nano 33 BLE Sense development board was used. The neural network model was built and trained on the Edge Impulse software platform. To conduct the experiment, three groups of data with the names "hello", "unknown", "noise" were created. The group "hello" contains 94 examples of the word "hello" in English, spoken by a female voice. The "unknown" group contains 167 examples of other words pronounced by both female and male voices. The "noise" group contains 166 samples of noise and random sounds. According to Edge Impulse's recommendation, 80% of the samples from each of the data groups were used to train the neural network model, and 20% of the samples were used for testing. Analysis of the results shows that with an increase in the number of MFC coefficients and, accordingly, the accuracy of keyword recognition, the amount of program memory occupied by the code increases by 480 bytes (less than 1%). For the nRF52840 microcontroller, this is not a significant increase. The amount of RAM used during the experiment did not change. Although the calculation time of the accuracy of the code word definition increased by only 14 ms (less than 5%) with the increase in the number of MFC coefficients, the calculation procedure is quite long (approximately 0.3 s) compared to the sound sample length of 1 s. This can be a certain limitation when processing a sound signal with 32-bit microcontrollers. To analyze phrases or sentences, it is necessary to use more powerful microcontrollers or microprocessors. Based on the results of experimental research, it can be stated that the computing resources of 32-bit microcontrollers are quite sufficient for recognizing voice commands with the possibility of pre-digital processing of the sound signal, in particular, the use of low-frequency cepstral coefficients. The selection of the number of coefficients does not significantly affect the amount of used FLASH and RAM memory of the nRF52840 microcontroller. The comparison results show the superiority of the 2D network in the accuracy of the keyword definition for both 12 and 13 MFC coefficients. The use of a one-dimensional convolutional neural network for voice sample recognition in the conducted experiment provides memory savings of approximately 5%. The quality of keyword recognition with the number of MFC coefficients of 12 is approximately 0.7. For 17 MFC coefficients, the recognition quality is already 0.97. The amount of RAM used in the case of the 2D network has decreased slightly. Voice sample processing time for both types of networks is practically the same. Thus, 1D convolutional neural networks have certain advantages in microcontroller applications for voice processing and recognition. The limitation of voice recognition on the microcontroller is the sufficiently long processing time of the sound sample (approximately 0.3 s) with the duration of the sample itself being 1 s, this can be explained by a sufficiently low clock frequency of 64 MHz. Increasing the clock frequency will reduce the calculation time.

Publisher

Kyiv Politechnic Institute

Subject

General Medicine

Reference24 articles.

1. Comparison of MFCC and LPCC for a fixed phrase speaker verification system, time complexity and failure analysis;Misra;2015 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2015],2015-03-01

2. Comparison of different implementations of MFCC;Zheng;Journal of Computer Science and Technology,2001

3. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition;Sahidullah;Speech Communication,2012

4. Hardware–Software Codesign of Automatic Speech Recognition System for Embedded Real-Time Applications;Cheng;IEEE Transactions on Industrial Electronics,2011

5. Amazigh Speech Recognition Embedded System;Barkani;2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET),2020-04-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3