Theoretical limit of Chipwhisperer for SCA

NewDwarf · January 4, 2024, 7:14pm

Hi!
According to specs, the CW Lite has 10-bit 105MS/s ADC but CW Husky has 12-bit 200 MS/s.

Whether I correctly understand that sampling rate depends on many factors like

Instructions being executed by CPU (each instruction has its own effect depending of using resources like registers, memory)
Order of instructions (effect of the specific instruction in the pipeline can interfere each other changing the shape of a power trace)
Number of stages in the pipeline (more instruction in the pipeline have more interference in the power trace)
Taking above into account, what is the generic rule to set sampling rate in accordance with the clock which drives the target CPU? What is the clock limit of the target CPU to be able to gather power traces by means of CW Lite so that to run the CPA attack?

…I have been working with the CW Lite, CW308 UFO and CW308T (STM32F415RGT6 with HW AES) to get the 128-bit AES key via SCA and I asked additional questions:

How many clock cycles does HW crypto engine (on STM32F415RGT6) take to execute the AES round?
If STM32F415RGT6 or any other MCU’s take 1 clock cycle to execute an AES round, is it possible to build the Hamming distance leak model? How many ADC samples does CPA require in this case?
Some CPU’s have DMA crypto engine which runs 3 AES rounds per clock cycle. Is it possible to run the CPA against such implementation?

jpthibault · January 4, 2024, 7:53pm

CW-lite is meant to sample the target synchronously at either 1 or 4 samples per target clock cycle. So, the target clock frequency doesn’t matter as long as it results in a sampling rate that’s supported by CW-lite. (I wonder if you are confusing “sampling rate” with “number of power traces”?)

I don’t have an STM32F4 handy, but if you run a capture of it using its HW engine, then scope.adc.trig_count will tell you.

There is no general answer: every target / implementation can be different with regards to both leakage model and number of samples and number of traces.

Again there is no general answer to such questions; for example if the implementation includes strong countermeasures, a CPA attack may not be possible! However, just having more than one round per clock cycle is not sufficient to disable CPA attacks; we demonstrate this (on an FPGA target) in this notebook.

NewDwarf · January 5, 2024, 2:30pm

CW Lite has an Analog-to-digital converter. Digital signal processing is limited by the Nyquist–Shannon sampling theorem. This theorem gives physical limitations to use CW Lite. If the core of the STM32F415RGT6 will be driven, say, by 200 MHz, we just loose critical information from the power trace being sampled and we will not able to run the SCA attack.
The CW Lite has much lower sampling rate (105 MS/s) than the frequency (200 Mhz) in above example.
According to the theorem, the max frequency to drive the STM32F415RGT6 is ~50 Mhz.
But this is theory. What about your experience?
What is the real frequency of the CPU core of STM32F415RGT6?

jpthibault · January 5, 2024, 3:15pm

But the sampling theorem doesn’t really apply here (or rather, it’s not useful here), because the goal is not perfect reconstruction of the power trace in the digital domain. For even very low target clock frequencies, the bandwidth of the power signal will be way beyond what CW-lite can capture. Yet, side-channel attacks can still work, because side-channel attacks do not require a perfect digital capture of the power trace; they just need some small “leakage” that is tied to the secret you want to recover.

What CW shows is that synchronous sampling at 1 (or 4) samples per clock performs very well for side-channel attacks. Obviously, sampling the power once per cycle does not fully capture the power trace – but that’s ok. This paper explores this in more detail.

You can also think of it this way: you could even further undersample (e.g. sample every 10 cycles) and still succeed in a side-channel attack, if the cycle on which you are sampling is carrying some side-channel power leakage. How this applies to any given target will depend on a huge number of variables, from its high-level architecture all the way down to the transistor-level physical implementation.

When a chip manufacturer quotes a maximum frequency, it means that they guarantee it will function correctly at that frequency, even at the extremes of allowed temperature and voltage. You can overclock past that, and it may work (especially if temperature is low, and voltage is higher than nominal), or it may not, but there are no guarantees.

NewDwarf · January 5, 2024, 6:50pm

Hmm… I thought this theorem is applicable to the SCA as well.
Suppose, we have such sampling data of the power trace which is valid for SCA
[634, 84, 384, 2, 95, 711, 9]
if we double CPU core clock and use the same sampling rate, we miss many information like sharp spikes which reflect key usage and get such data for the same power trace
[612, 137, 412, 33, 74, 681, 1]
the data array is close to the first one but not exactly the same.
I don’t think the second array will be valid for the SCA attack.

Let me ask again what is the real frequency of the CPU core of STM32F415RGT6?
I have some ideas to check my assumptions to confirm or disprove them.
…this information is fundamental to develop own attacks.

NewDwarf · January 6, 2024, 11:34am

I did some work to confirm the Shannon theorem pattern is applicable very well for the SCA attack.
…regarding the question about the frequency of the CPU core of STM32F415RGT6 used by the ‘simpleserial’ application.
The system core clock is calculated as SystemCoreClock = ((INPUT_CLOCK (HSE_OR_HSI_IN_HZ) / PLL_M) * PLL_N) / PLL_P
from hal/stm32f4/stm32f4_hal.c, we have

    RCC_OscInitStruct.PLL.PLLM       = 12;  // Internal clock is 16MHz
    RCC_OscInitStruct.PLL.PLLN       = 196;
    RCC_OscInitStruct.PLL.PLLP       = RCC_PLLP_DIV4;

SystemCoreClock = ((7370000 / 12) * 196) / 4 = 30 094 166 Hz

The lab uses the clock source for the ADC module scope.clock.adc_src = "clkgen_x4"
which is 7370000 * 4 = 29 480 000 Samples per second.
We can see that the sampling rate of the CW Lite is almost equal to the system core of STM32F415RGT6.

To confirm the Shannon theorem pattern for the SCA, we can just change scope.clock.adc_src = "clkgen_x4" on scope.clock.adc_src = "clkgen_x1".
In this case we get 4 times less sampling rate than the target CPU core frequency (7 370 000 samples per second against 30 094 166 Hz of CPU core).
Below are the pictures which reveal how reliability of recovering of the last round key was changed

It proves that the limit of CW devices to run the SCA attack is close to the sampling rate limit.
So CW Lite has reliable work when the target devices is driven by the clock up to 100 Mhz.
The CW Husky up to 200 Mhz.

But I believe it very depends of the HW crypto IP implementation (how many rounds are executed per a clock cycle).

dongoctuan · January 9, 2024, 4:21pm

Have you tried the attacks with larger number of power traces? I think it is possible to attack on underdamping SCA data, but we need more power traces.

NewDwarf · January 9, 2024, 5:56pm

I describe a bit different problem here…
The original discussion was dependency between the clock frequency which drives the CPU core and the sampling rate of the CW.
Of course, if we increase the number of traces being collected we can see better picture. But it is not a guarantee this happens.