Speeding up trace acquisition

I’m doing some fixed-vs-random TVLA with the CW-lite and CW305. Currently, the loop looks something like:

  • Prep input, write to registers on target,
  • Trigger crypto and capture,
  • Read back trace,

and I periodically check the output and write some traces to file. However, this is pretty slow. Gathering a couple million traces takes a day or two. I assume communication with the PC is the bottleneck. This is OK for smaller units (e.g. put a bunch of parallel copies of the UUT on the target), but not for testing larger higher-order masked implementations.

The “best” solution is probably to shift input generation to some other hardware that communicates faster with the target (and whose power supply is sufficiently separated from the target’s). E.g. write the fixed input once and let this additional hardware generate randomness, split the input, etc. There was at least one thread discussing something like this: CW305 Improved acquisition rate.

A few questions:

  • What can I do short of the above to increase the rate of acquisition? Any tips or tricks?
  • What’s the simplest way to implement something like the above?
  • How do people get ~100M traces in a reasonable amount of time? Do they have parallel setups running for a week or two?

Yeah there’s a couple of bottlenecks. First is the communication between host PC and target. The way our examples are built, the host needs to send the job setup information to the target prior to every single capture. That is a major factor in limiting capture speed. There are different ways around that- the post you linked involved modifying the CW305 SAM3U FW, but it’s also possible to do this in the target FPGA. The objective is to have a single host command that kicks off multiple target jobs.

Once you have this in place, the second change needed is around the trace capture. Again the objective is to run a single capture command from the host side which will capture the power traces for multiple jobs in a single go. With CW-lite you’re a quite limited since it has a limited sample storage, and no streaming. With Husky it’s much easier; with streaming and the segmented capture mode (see section 4 here), it’s very easy to grab multiple power traces at once.

This blog post from Raelize is a nice write-up on doing exactly this. It should be possible to reach about the same capture rate (billions of trace per day!) with Husky.