1. Introduction, AI-based BMS needs rigorous central processors and high memory bandwidth and cannot provide expected performance levels:. Efficient and powerful hardware is essential for both training and inference tasks. However, the complex nature of computations often hampers software implementation, resulting in longer processing times. To expedite computations, GPU or CPU utilization can be considered. However, these platforms are costly, power-intensive, and unsuitable for embedded systems. A potential solution lies in dedicated hardware accelerators designed explicitly for the algorithm. Implementing specialized accelerators on embedded systems like FPGAs can minimize power consumption while remaining cost-effective. This approach empowers the system to learn from experience, improve over time, and handle the expanding size of LSTM network models, enhancing accuracy, like other deep learning networks. However, larger models require more computations that can overpower the available CPU. Researchers have explored similar computing devices, such as GPUs, that excel at parallel computation to address this issue and enhance the performance of LSTM. However, the parallel processing of LSTM on GPUs is limited due to the recurrent nature of LSTM and the high power consumption, which poses a significant challenge [1]. LSTM, the most widely employed and representative recurrent neural network (RNN) architecture, plays a critical role in various applications like language modelling, machine translation, image captioning, and speech processing. However, achieving high parallelism with general-purpose processors like CPUs and GPUs is challenging due to the recurrent nature of LSTMs, and their power consumption is considerable. On the other hand, FPGA accelerators can outperform CPUs and GPUs due to their flexibility, energy efficiency, and ability to optimize each algorithm phase more delicately [2]. After training an LSTM model on a dataset, its value lies in its capacity to extract meaningful insights from completely novel data. However, due to the inherent complexity of LSTM models, the inference stage of LSTM often demands substantial computational power and memory resources to handle real-time workloads effectively. To deal with the above problems, FPGAs are emerging as an ideal alternative due to their low power consumption and low latency characteristics that are advantageous for accelerating and optimizing LSTM algorithms. This paper proposes an FPGA implementation AI-based Accelerator for EVs Battery Management System. FPGA-based accelerators have garnered significant interest among researchers due to their impressive performance, high energy efficiency, and exceptional flexibility. This paper summarizes the LiB SOC prediction results obtained through PyCharm and Vitis HLS implementation on SOC. The FPGA implementation presents a potential chip design for predicting SOC
2. Related Work & Literature Review In their study, He, D., [3] implemented LSTM on FPGA and demonstrated performance improvements of 8.8 and 2.2 times and energy efficiency improvements of 16.9 and 9.6 times compared to CPU and GPU, respectively., Furthermore, et al.: integrating the LSTM acceleration engine with TensorFlow is proposed as a promising direction for future endeavours. Jia, Y. [4] present the LSTM network to address the challenges associated with computing and energy efficiency. Chang Gao [5] introduced a structured pruning method called CBTD achieved up to 96% weight sparsity while maintaining accuracy in an LSTM network. The accompanying accelerator, Spartus, achieved 1us inference latency and surpassed previous benchmarks with 4X higher throughput and 7X better energy efficiency. Yijin Guan [6] presented an FPGA-based accelerator designed explicitly for LSTM-RNNs to optimize computation performance and communication requirements. The proposed architecture achieves a peak performance of 7.26 GFLOP/S, outperforming previous approaches. Additional research opportunities involve exploring the potential of storing parameters in quantized fixed-point data formats to minimize resource utilization and extending the acceleration framework to other LSTM-RNN variants. The proposed architecture significantly improves frequency, power efficiency, and GOP/s compared to recent works. Finally, the proposed architecture operates at 17.64 GOP/s, 2.31× faster than the best previously reported results. Andre Xian [7] introduce three hardware accelerators for RNN on Xilinx's Zynq SoC FPGA are introduced in this study. These accelerators are designed to enhance the performance of RNN computations on the FPGA platform. The latest design, DeepRnn, achieves significant performance gains and power efficiency compared to existing platforms. This work has the potential to be further developed for future mobile devices. R. Tavcar [8] aim is to find the most effective approach for implementing the LSTM forward pass and on-chip learning in hardware. The ultimate objective is to create a co-processor specifically designed for RNNs that can be seamlessly incorporated into upcoming devices. However, additional research is needed to achieve this goal. Xiang Li [9] proposed a novel algorithm called LightRNN for Natural Language Processing (NLP) tasks. LightRNN adopts a 2-Component shared embedding approach for word representations, enhancing the model size and running time efficiency. This approach proves particularly advantageous when dealing with text corpora containing large vocabularies. C. Li paper [10] introduced a SOC prediction approach for Lithium-ion batteries, aiming to accurately determine the remaining charge level based on various parameters and characteristics, utilizing GRU-RNN and advanced deep learning techniques. The method establishes a correlation between measurable factors, such as voltage, current, and temperature, to accurately estimate the SOC of Lithium-ion batteries. To assess the effectiveness of the proposed approach, two publicly available datasets of vehicle drive cycles and a dataset representing high-rate pulse discharge conditions were utilized. The method yielded MAEs of 0.86%, 1.75%, and 1.05%, respectively. Future efforts refine the proposed method to enable its implementation on prototype hardware for the BMS. The ongoing research in this field[11] is centred around altering the architecture of LSTM to facilitate the overlapping computation of various stages within the network. By utilizing FPGAs, users can make trade-offs between resource utilization and operational speed by adjusting the level of parallelization for various operations, such as vector multiplication and addition. This flexibility enables the customization of the computational workload to match different FPGA boards. D. J. Pagliari [12] implemented a hardware accelerator for LSTM-RNNs using microarchitectures to exploit parallelism. They presented three different approaches for implementing RNNs in FPGA, with a focus on LSTM architecture
3. Motivation Based on the above literature review, data-driven approaches are accurate and robust and do not rely on any model or prior knowledge of the battery's internal parameters but require a large amount of data:. Researchers suggest that deep learning algorithms such as RNN, LSTM & BiLSTM algorithms offer more accuracy and advantages than ML model-based algorithms. Deep Learning algorithms are Machine Learning subsets consisting of three or more network layers. Every layer in a deep learning network consists of complex mathematical operations, including multiplication and accumulation, which involve processing the input data with various constraints such as activation functions, weights, and biases. So, we may require on-chip memory and a reliable processor to store data and real-time data processing. Currently, CPUs and GPUs are used to process the data but CPUs are slower than GPUs. To implement Deep Learning algorithms, high-speed processors like GPUs are often used. However, GPUs can be both power-hungry and expensive, making employing them in real-time applications challenging. On the other hand, FPGAs are highly scalable and configurable and consume less power. Considering both GPUs and CPUs, FPGAs can be a preferable option. FPGA provides low-cost hardware and improves throughput. Based on the summary, findings, and current research development, we proposed AI-based data-driven SOC and SOH Prediction algorithms using a low-cost hardware FPGA to improve the performance. Given the computational complexity associated with deep learning, this paper proposes an FPGA implementation AI-based Accelerator for Electric Vehicles BMS and develops a hardware prototype. Researchers have shown significant interest in FPGA-based accelerators due to their remarkable performance, high energy efficiency, and exceptional flexibility. FPGA is an ideal hardware platform for embedded devices, offering high performance and low energy consumption. Additionally, HLS has considerably reduced the design period
4. LSTM:
5. LSTM is an abbreviation for Long Short-Term Memory:. It is a Recurrent Neural Network (RNN) specifically designed to tackle the problem of vanishing gradients. It can occur in standard RNNs when propagating information over long sequences. The LSTM architecture includes a set of memory cells and gating mechanisms that enable the network to selectively remember information or forget information over time-based on the relevance of that information to the current task. Due to their ability to handle long-term dependencies, LSTMs are particularly effective for tasks involving Natural Language Processing (NLP), speech recognition, image processing and time series prediction. Figure 1 illustrates the architecture of the LSTM cell. At each time step t, the LSTM cell takes input sequences xt and ht-1 from the previous time step t-1. It generates an output ht, which becomes part of the input for the subsequent time step. This feedback mechanism continues throughout the network until step t is reached. The LSTM cell consists of multiple components, such as an input gate, a cell gate, an output gate, and a forget gate. These components enable the cell to store and discard features as needed. Moreover, each LSTM cell possesses a dedicated memory, denoted as ct, to maintain the current state of the network