Mobile QR Code QR CODE

  1. (Department of Electrical Engineering, POSTECH, Pohang, Gyeongbuk, Korea 37673)
  2. (Department of Communication Engineering, Handong Global University, Pohang, Gyeongbuk, Korea 37554)



Post-quantum cryptography, lattice-based cryptography, security, accelerator, learning-with errors

I. INTRODUCTION

As large-scale quantum computer development progresses [1], threats have been raised about the vulnerability of traditional public-key based cryptography. The National Institute of Standards and Technology (NIST) has been standardizing on post-quantum cryptography (PQC) (Fig. 1). After several rounds, in August 2023, NIST published the initial drafts (FIPS-203, FIPS-204, and FIPS-205) for PQC [2]. Two of them are ML-KEM and ML-DSA algorithms based on module learning with errors (MLWE), which are based on lattice-based cryptography that is robust against quantum attacks.

Because ML-KEM and ML-DSA require complex computations, their software implementation introduces a lot of latency. In particular, sampling based on cryptographically-secure pseudo random number generator (CS-PRNG) and number-theoretic transforms (NTT) based on modular arithmetic are very complex for microcontroller units (MCUs) to handle. To solve these problems, MLWE accelerator implementations on ASICs or FPGAs have been reported [3-6]. Especially considering that IoT devices are more vulnerable to security attacks, a compactly designed crypto processor is needed in the IoT environment.

The followings are the contributions of this work. 1) It reduces lots of latency of sampling and NTT operations by using parallel hardware structure. 2) A modular multiplier design optimized for the specific prime numbers of ML-KEM and ML-DSA. 3) This work supports both of number systems for ML-KEM and ML-DSA efficiently by using clock-gating method.

This work presents the fastest PQC crypto-processor that processes MLWE-based PQC algorithms ML-KEM and ML-DSA. It is implemented and estimated under 28nm CMOS technology @0.9V supply voltage. The estimation is done by using Synopsys Design Compiler syn_R-2020.09, IC Compiler 2020.09-SP1, Primetime prime-R-2020.09-SP1.

Fig. 1. A summary of PQC standardization
../../Resources/ieie/JSTS.2024.24.1.55/fig1.png

II. ARCHITECTURE

Fig. 2 shows the overall architecture of the proposed PQC processor. It consists of memory and operator modules for sampling and arithmetic.

The memory module has an instruction cache (2 KB), banks for polynomials (6 KB), and banks for twiddle factors (1.5 KB). 32-bit instructions can be stored in the instruction cache. The on-chip memory for 8 polynomials is required to process the PQC algorithms on chip. For each polynomial, a bank consists of 16 polynomial coefficient memories, enabling polynomial operations to be accelerated 16x faster. The twiddle factors stored in the twiddle banks are used in the NTT computation\textcolor{color-1}{.} We could reduce the size of ${\omega}$-bank by merging the NTT and multiplication by powers of ${\psi}$ (${\omega}$ = ${\psi}$$^{2}$).

The Arithmetic module performs the polynomial operations required by PQC. It handles modular arithmetic of polynomials, NTT operations, etc. Arithmetic Logic Elements (ALEs) array efficiently handle the operations, and each ALE contains necessary arithmetic units.

The Keccak-based CS-PRNG generates a random bitstream for sampling. The bitstream is passed through specific sampling algorithm filters to generate polynomial coefficients with a uniform, Gaussian distribution. With parallel structure of memory and sampling filters, the latency for sampling is reduced.

Fig. 2. Overall architecture of the proposed crypto-processor.
../../Resources/ieie/JSTS.2024.24.1.55/fig2.png

III. SAMPLER

In software implementation, sampling operations account for 70% of the computation latency [4]. Fig. 3 shows the architecture of the sampler module. CS-PRNG generates a random bitstream, and 5 different sampling filters produce polynomial coefficients with uniform or gaussian distribution.

Fig. 3. Block diagram of sampler module.
../../Resources/ieie/JSTS.2024.24.1.55/fig3.png

1. Real-time Psuedo Random Number Generator

The proposed CS-PRNG uses the 1600b Keccak-f algorithm to compute SHA-3 functions. It performs four hash functions, called SHA3-224, SHA3-256, SHA3-384, and SHA3-512, and two extendable-output functions (XOFs), called SHAKE128 and SHAKE256. The random bitstream used for sampling are generated by SHAKE128 and SHAKE256. Since these functions are based on very complex operations, hardware implementation of SHA-3 significantly reduces latency and energy consumption, compared to software implementation.

2. Sampling Filters

ML-KEM and ML-DSA use uniform rejection (UR), centered binomial distribution (CBD), uniform eta (UE), uniform gamma (UG), and sample in ball (SIB) algorithms for sampling operations. The UR filter samples a random number with equal probability within [0, q-1]. It discards the number that doesn’t satisfy the condition ({\textless} q) of the rejection bound in order to make data with uniform probability from the incoming bitstream. CBD filter generates polynomial coefficients with Gaussian distribution. It samples the result of subtraction after obtaining two Hamming Weights (HW) from random bitstream. UE and UG filters sample the number within the range [-${\eta}$, ${\eta}$] and [-${\gamma}$+1, ${\gamma}$] respectively. The filter is designed through a comparator with ${\eta}$ and ${\gamma}$ constants set according to the PQC protocol. The SIB filter is implemented using a comparator and a mux. Since it needs to be operated by checking the polynomial coefficient one by one, it is implemented without parallelism.

3. Evaluation

Fig. 4 and Table 1 show a comparison of the latency generated in sampling by the software and hardware implementations. When performing UR sampling, which consumes the most clock cycles in software implementation, the proposed work reduces the latency by 99% compared to software. For CBD and UE sampling, the latency reduction is 92% and 87% respectively. This improvement is achieved by implementing the efficient CS-PRNG hardware design and parallelization of the polynomial banks. The software benchmark is based on an Intel i7-7700 environment.

Fig. 4. Latency of sampler for ML-KEM-512 and ML-DSA-44.
../../Resources/ieie/JSTS.2024.24.1.55/fig4.png
Table 1. Clock cycles of sampling operations

Latency of Sampler

ML-KEM-512

Intel i7-7700

Intel i7-7700

(w/AVX2)

This work

UR (x4)

72,256

15,588

808

CBD (x1)

5,251

6,876

99

ML-DSA-44

Intel i7-7700

Intel i7-7700

(w/AVX2)

This work

UR (x16)

241,688

87,480

2,752

UE (x1)

4,850

2,934

155

IV. ARITHMETIC LOGIC ELEMENT ARRAY

Fig. 5 shows the structure of an ALE array. The ALE array consists of 32 ALEs. Each ALE contains an arithmetic, modular arithmetic, and logical arithmetic operator. All of 32 ALEs operates when the crypto-core processes NTT or INTT instructions. For modular multiplier, we implemented Barrett reduction as hardware.

Fig. 5. Block diagram of ALE array module.
../../Resources/ieie/JSTS.2024.24.1.55/fig5.png

1. Modular Multiplier

The modular multiplication is the heaviest part in the ALE. It involves arithmetic multiplication and calculating the remainder on the q values. To reduce the complexity of the operation, Montgomery, Barrett reduction, etc. are used. However, the reduction operation also requires high power consumption and latency.

On the other hand, modulo q of the PQC protocols is a fixed prime number. Under these conditions, Barrett reduction can be implemented through a simple algorithm (Fig. 6). We implemented the reduction using only a few adders, subtractors, and shifters.

Fig. 6. Block diagram of modular multiplier.
../../Resources/ieie/JSTS.2024.24.1.55/fig6.png

2. NTT/INTT

Fig. 5 shows the data flow of NTT/INTT. The operation is processed through Cooley-Tukey (CT) and Gentleman-Sande (GS) butterfly configurations, respectively. Each ALE uses modular operators to implement the appropriate butterfly unit according to the operation. We reduced memory accesses by using an 8x4 ALE array. Since the dimension of the polynomial in both ML-KEM and ML-DSA is 256, only 16x2 16-to-16 bijection operations are required to complete the NTT/INTT operation. The implementation is realized by inserting a reordering MUX stage. The reordering block configures any bijection according to the operation type and NTT/INTT stage.

3. Polynomial Arithmetic

The key, message, and ciphertext of the MLWE algorithm are all coefficients of a polynomial. Except for sampling, pointwise arithmetic between polynomials accounts for most of the computation in both ML-KEM and ML-DSA. We use ALE arrays to parallelize the polynomial arithmetic. Since we have 16-parallel memory of coefficient, 16 ALEs out of 32 are used for polynomial arithmetic, which are shown as gray box in Fig. 5.

4. Evaluation

Table 2 shows a comparison of the latency of NTT/INTT in the software and hardware implementations. Compared to the Intel i7-7700 environment, the proposed hardware reduces the running time of NTT/INTT by 99%.

Table 2. Clock cycles of NTT & INTT operations

Latency of NTT & INTT of a polynomial

ML-KEM

Intel i7-7700

Intel i7-7700

(w/AVX2)

This work

NTT

9,138

365

61

INTT

12,301

382

61

ML-DSA

Intel i7-7700

Intel i7-7700

(w/AVX2)

This work

NTT

6,750

2,270

61

INTT

9,535

2,115

61

V. PARALLEL OPERATION AND CLOCK GATING

To reduce the execution time of the protocol, we designed the instructions that make the operation of the sampler and ALE array to be performed in parallel. We also reduced the power consumption of the ML-KEM by clock gating.

1. Parallel Operation

Depending on the order of the instructions, the sampler and ALE array can operate independently. If there is a situation where both sampling and NTT operations are required for independent polynomials, the latency can be reduced by performing the operations in parallel (Fig. 7).

Fig. 7. Effect of instruction for parallel operations.
../../Resources/ieie/JSTS.2024.24.1.55/fig7.png

2. Clock Gating

In the proposed implementation, both ML-KEM and ML-DSA are performed on the same hardware. In both of them, all operations are performed on the quotient ring, but modulo q values are different (3329 for ML-KEM and 8380417 for ML-DSA). Operating ML-KEM, about half of registers consume power just by clock network. We applied clock gating for removing the unnecessary power consumption. Fig. 8 shows a comparison of the power consumption of the ALE array when performing NTT operations on ML-KEM. The power consumption of ALE array is reduced by 16% by clock gating.

Fig. 8. Effect of clock gating.
../../Resources/ieie/JSTS.2024.24.1.55/fig8.png

VI. IMPLEMENTATION

The PQC processor is implemented on a 28 nm CMOS technology process using 9.5 KB of on-chip memory and 511.8k NAND gate equivalents. Fig. 9 shows the layout of the proposed design. Table 3 compares the performances with previous works. This implementation consumes 16~24 mW from a supply voltage of 0.9 V while processing ML-KEM-512 and ML-DSA-44. Compared to state-of-the-art [6], the implementation shows 2.91x and 2.62x higher power efficiency for ML-KEM-512 and ML-DSA-44, respectively.

Fig. 9. Layout of PQC processor.
../../Resources/ieie/JSTS.2024.24.1.55/fig9.png
Table 3. Performance Comparison

[4]

[5]

[6]

This Work

Process

40 nm

28 nm

28 nm

28 nm

Frequency [MHz]

12~72

300

500

83*

Voltage [V]

0.68~1.1

0.9

0.9

0.9

Area [mm2]

0.28

-

3.6

0.83

Logic Gates

106k

942k

-

560k

SRAM [KB]

40.25

12

448

7.49

Power [mW]

5~7.5

-

163~304

16~24*

ML-KEM

Support

Support

Support

Support

ML-DSA

Support

-

Support

Support

ML-KEM-512 (Keygen + Encaps. + Decaps.)

Clock Cycles

348,526

144,431

10,433

5,794

Power. Eff. [OPS/W]

37,617

78,125

294,018

856,768*

Area. Eff. [GOPS/Wm2]

134

-

82

1,032

ML-DSA-44 (Keygen + Sign + Verify)

Clock Cycles

829,201

-

43,376

27,426

Power. Eff. [OPS/W]

11,472

-

48,637

127,477*

Area. Eff. [GOPS/Wm2]

41

-

14

153

* Measured by Primetime prime-R-2020.09-SP1.

V. CONCLUSIONS

This work proposes a hardware architecture for a PQC processor that supports ML-KEM and ML-DSA. The proposed design enables efficient modular multiplication using Barrett reduction, which is configurable specifically for the PQC algorithm. It also utilizes a parallel memory structure and a real-time sampler module. The architecture is implemented in 28nm CMOS technology. The proposed design uses the smallest on-chip memory and achieves 2.91x higher power efficiency compared to the previous state-of-the-art.

References

1 
J. Park, et al., “A Fully Integrated Cryo-CMOS SoC for Qubit Control in Quantum Computers Capable of State Manipulation, Readout and High-Speed Gate Pulsing of Spin Qubits in Intel 22nm FFL FinFET Technology,” ISSCC, pp. 208-209, Feb. 2021.URL
2 
National Institute of Standards and Technology (NIST), “Module-Lattice-Based Key-Encapsulation Mechanism Standard,” Federal Information Processing Standards Publications (FIPS PUBS) 203, Aug. 2023.URL
3 
L. Ma, X. Wu and G. Bai, "Parallel polynomial multiplication optimized scheme for CRYSTALS-KYBER Post-Quantum Cryptosystem based on FPGA," 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), Beijing, China, 2021, pp. 361-365.DOI
4 
U. Banerjee, A. Pathak, and A. P. Chandrakasan, “An Energy-Efficient Configurable Lattice Cryptography Processor for the Quantum-Secure Internet of Things,” ISSCC, pp. 46-47, Feb. 2019.URL
5 
G. Xin, et al., “VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryptography Based on RISC-V Architecture,” TCAS-I, vol. 67, no. 8, pp. 2672-2684, Aug. 2020.DOI
6 
Y. Zhu, et al., “A 28nm 48KOPS 3.4µJ/Op Agile Crypto-Processor for Post-Quantum Cryptography on Multi-Mathematical Problems,” ISSCC, pp. 514-516, 2022.DOI
ByungJun Kim
../../Resources/ieie/JSTS.2024.24.1.55/au1.png

ByungJun Kim received the B.S. and M.S. degrees in electronic and electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2016 and 2018, respectively, where he is currently pursuing the Ph.D. degree. His research interests include post-quantum cryptography processor and homomorphic encryption accelerator.

Han-Gyeol Mul
../../Resources/ieie/JSTS.2024.24.1.55/au2.png

Han-Gyeol Mul received the B.S. degree in electrical engineering and the Ph.D. degree in convergence IT engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2015 and 2023, respectively. He is currently a Post-Doctoral Researcher with the Department of Electronic and Electrical Engineering, POSTECH. His research primarily focuses on developing low-power and energy-efficient deeplearning hardware solutions specifically tailored for edge IoT devices.

Shinwoong Kim
../../Resources/ieie/JSTS.2024.24.1.55/au3.png

Shinwoong Kim received the B.S. and M.S. degrees in electrical engineering and information and communication engineering from Handong Global University, Pohang, South Korea, in 2009 and 2011, respectively, and the Ph.D. degree in electronic and electrical engineering from the Pohang University of Science and Technology, Pohang, South Korea, in 2016. From 2016 to 2022, he was a Senior Engineer at Samsung Electronics, Hwasung, South Korea, where he involved in the design of Local Oscillator (LO) domain including all-digital phase-locked loop for RF communication systems. In 2022, he joined Handong Global University, Pohang, South Korea, where he is currently a Assistnat Professor. His current research interests include analog/digital frequency synthesizer and low-power clock generation.

JongMin Lee
../../Resources/ieie/JSTS.2024.24.1.55/au4.png

JongMin Lee received the B.S. degrees in electronic and electrical engineering from Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2023, where he is currently pursuing the M.S. degree. His research interests include Low DropOut regulator and hardware accelerator for machine learning.

Jae-Yoon Sim
../../Resources/ieie/JSTS.2024.24.1.55/au5.png

Jae-Yoon Sim (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronic and electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 1993, 1995, and 1999, respectively. From 1999 to 2005, he was a Senior Engineer with Samsung Electronics, Hwasung, South Korea. From 2003 to 2005, he was a Post-Doctoral Researcher at the University of Southern California, Los Angeles, CA, USA. From 2011 to 2012, he was a Visiting Scholar with the University of Michigan, Ann Arbor, MI, USA. In 2005, he joined POSTECH, where he is currently a Professor. Since 2019, he has been the Director of the Scalable Quantum Computer Technology Platform Center, Pohang, South Korea, nominated by the Korean Ministry of Science and Information and Communication Technology (ICT). His research interests include sensor interface circuits, high-speed serial/parallel links, phase-locked loops, data converters, braininspired computing, and quantum computing. Dr. Sim was a co-recipient of the Takuo Sugano Award from the IEEE International Solid-State Circuits Conference (ISSCC) in 2001 and a recipient of the Special Author Recognition Award from ISSCC 2013. In 2020, he received the Scientist of the Month Award sponsored by the Ministry of Science and ICT of Korea. He has been an IEEE Distinguished Lecturer from 2018 to 2019. He has served on the Technical Program Committees for the ISSCC, the IEEE Symposium on VLSI Circuits, and the IEEE Asian Solid-State Circuits Conference.