CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme-Reference-Cited by-同舟云学术

CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme

Published:2021-11-19 Issue: Volume: Page:94-126
ISSN:2569-2925
Container-title:IACR Transactions on Cryptographic Hardware and Embedded Systems
language:
Short-container-title:TCHES

Author:

Chen Xiangren,Yang Bohan,Yin Shouyi,Wei Shaojun,Liu Leibo

Abstract

Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.

Publisher

Universitatsbibliothek der Ruhr-Universitat Bochum

Subject

General Earth and Planetary Sciences,General Environmental Science

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Hybrid Radix-16 booth encoding and rounding-based approximate Karatsuba multiplier for fast Fourier transform computation in biomedical signal processing application;Integration;2024-09

2. Lightweight Extension of RISC-V Core for NTT-like Algorithms: (PhD Forum Paper);2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP);2024-07-24

3. RISC-V SoC with NTT-Blackbox for CRYSTALS-Kyber Post-Quantum Cryptography;2024 9th International Conference on Integrated Circuits, Design, and Verification (ICDV);2024-06-06

4. A High-Performance, Conflict-Free Memory-Access Architecture for Modular Polynomial Multiplication;IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems;2024-02

5. Scalable and Parallel Optimization of the Number Theoretic Transform Based on FPGA;IEEE Transactions on Very Large Scale Integration (VLSI) Systems;2024-02