Hacker News new | ask | show | jobs
by AlyssaRowan 4298 days ago
It's becoming gradually more tempting to write a crypto library in assembly language, because at least then, it says exactly what it's doing.

Alas, microcode, and unreadability, and the difficulty of going from a provably correct kind of implementation all the way down to bare metal by hand.

The proposed compiler extension, however, makes sense to me. Let's get it added to LLVM & GCC?

4 comments

That works for well-defined ISAs (like ARM), but not for those with undocumented pipelines, or instructions defined by practise (like x86 and amd64).

In other words, if you write a crypto library in x86 assembler, Intel don't guarantee that they won't introduce a side channel in their next chip model or stepping.

Sadly, I know that only too well: hence my "alas, microcode" comment! A prefix or mode or something which allows code to handle secure data and it gets constant-time multipliers, for example, or true µop-level register zeroisation, would be handy, but also close to unverifiable - we just have to sort of trust it, which sucks.

Until then, we do the best we can with turtles all the way down. Software running under that same undocumented pipeline is going to find it very hard to access or leak (accidentally or otherwise) internal registers, at least.

For the other avenue of attack (cold-boot attacks), it's also notable that registers, at least, have extremely fast remanence compared to cache, or DRAM - bit-fade is a very complex process, but broadly speaking, faster memory usually fades faster.

Digression along that vein: I basically pulled off a cold-boot attack on my Atari 520 STe in the early 1990s (due to my wanting to, ahem, 'debug' a pesky piece of software that played shenanigans with the reset vector and debug interrupts), with Servisol freezer spray pointed directly at the SIMMs in my Xtra-RAM Deluxe RAM expansion (and no, cold-boot attacks are not new, GCHQ's known about them for at least 3 decades and change under the peculiarly-descriptive cover name NONSTOP, I believe?). It just seemed sensible to me: cold things move slower, and they had a particularly long (and very pretty) remanence - I was able to get plenty of data intact, including finding where I needed to jump to avoid the offending routine and continue my analysis with a saner technique (i.e. one that didn't make me worry about blowing up the power supply or electrocuting myself)! It's harder these days - faster memory - but the technique incredibly still works and was independently rediscovered as such more recently: very much a "wait, this still works on modern RAM?" moment for me. (By the way, when I accidentally pulled out the SIMMs with the internal RAM disabled - whoops - and rebooted the Atari on my first try, it actually powered up with an effect that I can only describe as "pretty rainbow noise with double-height scrolling bombs" that would not have looked out of place in a demoscreen! I don't know if that was just mine, but... the ROM probably never expected to find RAM not working, and I guess the error-plotting routine had a very pretty and unusual error in that event?)

I've never seen or heard of anyone pulling off a NONSTOP on a register in a CPU, or actually even on an L1, L2 or L3 cache (maybe an L2 or L3 might be possible, depending on design?). They're fast - ns->µs remanence? - and cooling doesn't help much. I don't know if it's possible at all, but I'd tentatively suggest that it might be beyond practical attack - unless the attacker has decapped the processor and it's already in their lab (in which case you're fucked, no matter what!). That's what suggests that approaches like TRESOR (abuse spare CPU debug registers to store encryption key; use that key to encrypt keys in RAM), despite being diabolical hacks, actually work.

If you fancy giving it a try in the wild by the way, I think a Raspberry Pi might be a good modern test subject - the RAM's exposed on top of the SoC, there are no access problems, and it's cheap so if it dies for science, it's not such a problem. (Of course, you'd want probably to want to change bootcode.bin so that it dumps the RAM after it enables it but before it clears it.) The VideoCore IV is kind of a beast - and is frustratingly close to being able to do ChaCha20 extraordinarily efficiently, if I can just figure out how to access the diagonal vectors... or if I can, or if I can fake it.

If I wanted to read CPU registers from the outside, there's an easy way: JTAG. You should be able to halt the CPU, read (and modify!) the registers, and resume the CPU.

That should be possible even on x86, though on x86 the relevant documentation is probably hard to find. For some ARM processors, it should be as easy as installing openocd.

Of course, JTAG requires physical access to plug the debugging cable, which puts it in a different category of attack.

I can't believe I'd forgotten about JTAG! Yes, that's definitely more viable than decapping! <g> Same completely-doomed threat model though ("attacker has physical access, can do anything they want and take as long as they need").

Sorry, I've been dealing with a few things more recently which, uh, haven't been quite so accommodating to analysis.

There's a fundamental difference between your main memory and your L3 in that the former is DRAM and the later is SRAM. In DRAM you have a charge hidden in a well behind a single transistor and it's designed to be stable for a while (the refresh interval) without anybody doing anything to it. SRAM doesn't have that static component at all, it's a set of 6 or 8 transistors which have two stable configurations when powered and which lose their state as fast as all the other logic in your chip as soon as the power is cut.

You can play with the temperature if you want, but the mechanisms that prevent unauthorized access in normal conditions will have their lifetimes extended or decreased as much as you change the lifetime of the data you're trying to access. And liquid nitrogen temperatures at least tend to make everything happen faster in CMOS circuitry. That's governed by a complex interaction between the effect of temperature on carrier density and carrier mobility, so I'm not sure that you couldn't slow things down with, say, liquid helium, but even then I'm not sure you're buying anything.

Yes, SRAM has close-to-zero remanence. But if I were running, say, a Haswell, my L3 might indeed be DRAM (depending on the model).

Saves die space, even with the controller overhead. They love it on SoCs in general, particularly where anything with embedded graphics is concerned.

I'm curious what mechanisms you're talking about. Did you mean physical access? General-purpose processors don't have those (why would they?). Specialised cryptographic SoCs which try to prevent physical attack… well, let's just say for now that results may vary and that if a determined adversary has unlimited physical access to a device, you cannot prevent its compromise.

Another good reason to write crypto in assembly is to ensure that the implementation is not susceptible to timing attacks. If your code has different code paths that take different amounts of clock time attackers can use that. This can be difficult to achieve in a high level language.
Using assembly won't preclude timing attacks vulnerability, esp on x64. Nowadays beating even the C compiler performance wise is exceeding difficult with hand written assembly.
The point isn't to be faster, it is to be consistent.
That's what I mean actually getting it consistent is hard as the performance is really hard to predict and may change even with CPU stepping. Even then it requires very solid planning as well.
I think you misunderstand timing sidechannels. The idea is that (for example) if you compare two strings with length 15 you compare all 15 chars regardless if you find that the 3th char is already different.

You only need to be consistent with yourself. Stepping is completely irrelevant here.

An instruction that is constant time in one CPU may vary its time based on input in the next version of the CPU. That could still provide a timing channel in your example of a comparison if the comparison instruction finishes faster if, say, the left most bit is a mismatch.
>I think you misunderstand timing sidechannels.

I don't :) Basically you want all the code branches to result in similar (same) timings. Basic on the CPU and the data inputs those timing would vary, hence assembly alone won't do.

David Beazley after analyzing 1.5 Tbytes of C++ code shows in "Some Lessons Learned": C++ -- SUCKS, Assembly code -- ROCKS http://www.youtube.com/watch?v=RZ4Sn-Y7AP8#t=2049
This is what djb is doing using his "qhasm" assembly like language. He seems to be doing it for performance though, not to work around too aggressive compilers.

As an alternative, maybe write crypto algorithms in LLVM IR?

Adding an annotation for qhasm where stack variables/registers would be zero'd at the end of the function if they still contained sensitive data would be great.

What I'd really like to see is qhasm put on github along with the syntax files he or others create. q files aren't really useful without the syntax files they were made for, and without a central repo, custom made syntaxes will be a mis-mash of random decisions and instructions.