Hacker News new | ask | show | jobs
by anon4242 2402 days ago
> Nearly every SoC you can buy today has hardware accelerators in it

True, but few are full-featured HW acceleration SoCs. Most support a few operations like for instance AES-ECB and maybe AES-CBC but if you want AES-CCM or AES-GCM you still need to implement parts of it in software. The HW may be super fast at ECB:ing many blocks of memory but the setup cost is steep so when you need to ECB just a single block (for your counter in CCM) it buys you very little performance gains over just ECB in SW. (Of course what you do then is setting up several counters in a larger block of memory, after each other, this is ok because the counters are just increments, and you ECB a bunch of blocks. Next you need to solve how to do the same to get CBCMAC with just CBC HW...)

1 comments

This is just moving the goalposts. First it was "crypto accelerators are rare because ITAR", now it's "crypto accelerators are rare because they don't buy you much". Neither is true.

Crypto accelerators are extremely common, including those that implement full cryptosystems or even complete protocols. Nearly every wireless part will have them (especially for CCMP), as well as basically every modern+common consumer device SoC (eg, all Qualcomm, Samsung, Apple, AMD, and Intel parts). Several of these actually have overlapping accelerators for eg memory encryption or wireless (full protocol) and acceleration instructions like those for ARMv8. And they are there because they work.

Setup cost is a thing, but A) is largely paid when you rekey and therefore rarely for most protocols, B) is acceptable in many protocols because you can interleave other operations to prevent port contention without sacrificing throughout, and C) is often buried by the cost of a very small number of blocks, or even just one.

He didn't move the goalposts and usefully expanded on my point. Those devices you're talking about notably adhere to other external standards and are not typically user reprogrammable (where user is the integrator). Also important is that I would not consider them secure in general due to the standards they implement. You also certainly realize that their power consumption, when present, massively dwarfs the type of processor we were first discussing?

By the time you get to the ARMv8 accelerators, yes, you're going to exactly the same place I was arguing we should go with my original comment. There's actually a number of primitives that could be reused for various systems.

The original claim was that these parts were rare because of ITAR. They aren't rare, and ITAR doesn't have much to do with where they're present or absent. Shifting the argument to a different point about a specific accelerator or specific class of parts is exactly as I said: moving the goalposts.

The question of whether they're user programmable or not is nearer to the mark because EAR cares about it, but it still doesn't present a formidable barrier-- at least, I've been shipping parts with crypto accelerators at various levels of user configurability for a long time, and so has everybody else.

> Setup cost is a thing, but A) is largely paid when you rekey

Well, it depends on the crypto HW. Some HWs are designed for "throughput", which is completely useless for ECB but looks good on specs ("Our HW AES 10MB/s!"). So you set it up with src, dest and key pretty much as you setup your typical DMA transfer, only you almost never want to encrypt more than 16 bytes at a time with ECB so it's mostly wasted.

> consumer device SoC (eg, all Qualcomm, Samsung, Apple, AMD, and Intel parts)

We are not all so fortunate that we get to work with such powerful SoC. In my job it's mostly small embedded MPUs.

> B) is acceptable in many protocols

I think we are talking past each other here. I haven't even gotten to the protocol part yet. In order to support a wireless and/or network protocol you will need better building blocks than AES-ECB. You need AES-GCM (or at least AES-CCM). Not to mention ECDSA or RSA(>=3072)...

I'm super confused. Let's back up a step.

Most accelerators come in one of a few flavors:

1/ They implement the expensive parts of a primitive for you and let you chain them together. This is how AES-NI and the ARMv8 crypto extensions work. Performance for these is generally measured in terms of cycle latency, or with a reference piece of software in cycles per byte. Common values for cycles per byte are anywhere from about 0.2 to 30. Much higher than that and people will start to go look at software as an option. You tend to see these on beefy systems with out-of-order cores.

2/ They implement a primitive for you, eg AES-ECB or SHA256, or more rarely AES-GCM and similar. These can then be chained together as with the above to build even higher level primitives like AES-CTR or AES-CCM, or they can be used as-is. These are usually found on micros as additional selling points, and therefore show up just above the bottom of most manufacturers' product lines as an upsell. These are typically measured in something like MB/s throughput, and I assume they're what you're focused on.

3/ They implement a full protocol, like TLS, CCMP, or secure boot. These show up on things that might more properly deserve the term SoC rather than microcontroller, largely because they tend to be attached to high-speed I/O. They generally aren't measured for cryptographic performance but rather for the performance of the implemented protocol.

In my mind, all three of these are using crypto accelerators. Taken together it is extremely common that a part will have one or more of these, and I'm not sure if we're still disagreeing on one or both of those points.

Regarding ECB, I don't know what you mean. Almost nobody uses ECB alone (thank goodness). Even if they have an accelerator for it, it's usually used to implement something like CTR with some software to glue it together (maybe with then yet more glue to do GCM). In that way, those accelerators act like a just-barely-higher-level version of the first type-- and if what you have is the first type of course you'll do that no matter what. This is still an accelerated implementation, it's just not 100% done in the accelerator. Of course, if you're doing that you're very often encrypting more than a block at a time. And because it's quite rare that you will be performance bottlenecked on a small infrequent operation in any context, you generally only do the work to turn on the accelerators when you care about that.

Regarding working on MCUs, I agree there's a minimum size past which you don't get crypto primitives, but overall don't think characterizing those parts as modern SoCs is terribly accurate (which was my claim).

Regarding needing better building blocks than ECB for a protocol... well, no, not necessarily. AES-NI doesn't even give you a full AES primitive, and yet it's extremely widely used.

Yes, my main experience is with 2) and these are pretty "modern" (as in recently released MCUs) that support AES-ECB (and maybe a few more in HW). These are not ARMv8 but Cortex-M level MCUs.

The problem and the point I'm trying to make is that a few platforms implement their ECB support in such a way to make it almost useless as a building block. They do not do it as processor instructions the way it's done in x86 (the right way IMHO) but instead it's implemented in a separate co-processor that you program in a similar manner as you setup a typical DMA-transfer. If you aim to encrypt 1KB or more the setup cost for this is negligible and you can get a comparatively good speed. However as we both agree there are very few cases (if any) where you _actually_ want to run ECB over 1KB blocks at a time. When you want to build something like CTR (or CBC), what you need is a fast way to ECB _a single_ AES block (i.e. 16 bytes). With this kind of solution the setting up of the co-processor eats up almost any gains won by doing ECB in HW compared to doing ECB in SW because the cost of the setup (it's I/O after all) comes close to the cost of a SW only ECB of 16 bytes.

Hmm? With CTR you usually just want to fill a long buffer with the appropriate counters and then shove it all through the accelerator. The resulting stream can then be used until exhausted by whatever higher level primitive you're working with. Obviously there's a trade-off in sizing the buffer correctly, but dozens of blocks would be more typical than one.
Yes, and that is what I said in my very first post in this thread. Still you need to handle the counters in SW, do the XORs in SW (unless you have some HW that does that for you as well) and then if you want CCM you need to solve CBCMAC (maybe you have CBC in HW but then there's the memory trade-off again). If you want GCM you need to do BigInt muls (Cortex-M MCUs do not support 128 bit muls). So either way you end up doing pretty substantial parts of it in SW which limits the usability.