Hacker News new | ask | show | jobs
by midtake 814 days ago
> The threat resides in the chips’ data memory-dependent prefetcher, a hardware optimization that predicts the memory addresses of data that running code is likely to access in the near future.

Are we nearing any sort of consensus that any form of speculation is bad? Is there a fundamentally secure way to do it?

9 comments

My personal opinion is that we should solve it the opposite way - don't run untrusted code in the first place (with rare exceptions like dedicating an entire cpu core and a region of memory to a virtual machine, etc). Speculation is one of many side channel attacks, who knows what kind of crazy RF-based exploits are out there. AFAIK we still haven't fully solved rowhammer.

I think for "normal" users the main risk is JavaScript, which can (kind of) be mitigated in software without affecting the rest of the system, so no one really cares about these attacks. But the fundamental abstraction leak between physics and programming will always be there.

OpenSSL is "trusted code". The problem isn't that OpenSSL is doing something nefarious, but that the CPU breaks assumptions it makes about how one can write constant-time algorithms.
But the problem is not OpenSSL, it's that malicious code on the system can read the keys OpenSSL is using. If you don't run malicious code on the same system as OpenSSL, this attack goes away - there's no way to run a CPU timing attack from a different network.
"Remote Timing Attacks Are Practical" (2003):

https://crypto.stanford.edu/~dabo/papers/ssl-timing.pdf

That's on the local network. I remember a paper doing this over the internet, but couldn't find it. A similar one over the internet:

https://www.usenix.org/conference/usenixsecurity20/presentat...

But in practice it's going to be malicious JS running in your browser: https://www.schneier.com/blog/archives/2021/03/exploiting-sp...

From what I understand, the problem is that algorithms which should be constant time are actually taking a variable amount of time depending on the data. If I have some server software whose security depends on those constant-time algorithms actually being constant time, why shouldn't this be exploitable over the network?
The mechanism is described in the FAQ here: https://gofetch.fail

> The GoFetch attack is based on a CPU feature called data memory-dependent prefetcher (DMP), which is present in the latest Apple processors. We reverse-engineered DMPs on Apple m-series CPUs and found that the DMP activates (and attempts to dereference) data loaded from memory that "looks like" a pointer. This explicitly violates a requirement of the constant-time programming paradigm, which forbids mixing data and memory access patterns.

> To exploit the DMP, we craft chosen inputs to cryptographic operations, in a way where pointer-like values only appear if we have correctly guessed some bits of the secret key. We verify these guesses by monitoring whether the DMP performs a dereference through cache-timing analysis. Once we make a correct guess, we proceed to guess the next batch of key bits. Using this approach, we show end-to-end key extraction attacks on popular constant-time implementations of classical (OpenSSL Diffie-Hellman Key Exchange, Go RSA decryption) and post-quantum cryptography (CRYSTALS-Kyber and CRYSTALS-Dilithium).

It sounds like the OpenSSL code is still constant time (it doesn't expect pointers in the intermediate values, to OpenSSL it is just data, not something it will ever dereference) but the attacker-controlled "monitoring" code's runtime changes based on whether the DMP ran or not.

If that's right, then the attacker still needs to run their monitoring code on the target, it isn't sufficient to just run OpenSSL etc itself.

Edit: it is more explicit in the paper, they assume that OpenSSL (the victim) is still constant time:

> In this paper we assume a typical microarchitectural attack scenario, where the victim and attacker have two different processes co-located on the same machine.

> For our cryptographic attacks, we assume the attacker runs unprivileged code and is able to interact with the victim via nominal software interfaces, triggering it to perform private key operations. Next, we assume that the victim is constant-time software that does not exhibit any (known) microarchitectural side-channel leakage. Finally, we assume that the attacker and the victim do not share memory, but that the attacker can monitor any microarchitectural side channels available to it, e.g., cache latency.

Thanks for the details, I concluded more or less the same: https://news.ycombinator.com/item?id=39789307
Because the timing difference is extraordinarily subtle, far too small to measure compared to regular network timing noise.
Noise doesn't protect against that, statistics is a thing.

But I think you might overall be right that this requires two colocated processes: the paper talks about how the DMP breaks assumptions made by "the constant-time programming model", and I took this to mean that constant-time algorithms aren't constant-time any more. Reading more closely, I think maybe the issue is that "the constant-time programming model" was also assumed to make secrets safe from cache timing side-channels leaking the secrets to other processes on the same CPU, and this seems like it might be the assumption that's broken by the DMP...

I'll have to read more, I've just skimmed the abstract and introduction so far.

That usually just means that you need to collect more data to filter out the network noise.
Anything you execute on behalf of a user, even if the binaries are trusted, can effectively become untrusted code.
So, CPUs or cores (and maybe RAM) dedicated to run only trusted and only untrusted code?

Examples (I'm running Debian)

The kernel, the X11 server, terminal, ssh, bash, anything coming from the official Debian repos including the password manager: in the trusted environment.

browsers, electron apps, anything installed from unofficial repos or language and package managers (npm, rvm, asdf, etc): in the untrusted environment.

It reminds me of mainframes and their redundant and compartmentalized hardware architecture.

> terminal, ssh, bash

> X11 server

Those can very easily execute untrusted code.

Yes, but it can be countered by pinning "random-script-from-the-internet.sh" to the untrusted environment. The fork/exec inside bash (or whatever bash is using now) should take care or that, or the kernel itself which is probably a better option. bash + ls -> trusted because ls is in some way marked as trusted, bash + random-script -> untrusted, possibly by default.
Well it makes no sense to worry about side channel attacks if you don't have isolation in the first place, so there is an implicit assumption that you have a sandboxing layer like VM/container/browser (or the built in unix user separation) which don't care about terminals or X11 (usually a separate X server is used which is running inside the sandbox context).
the author(s) of the article, completely ignore the cpu itself can be patched. there is microcode underneath the ARM instructions for such scenarios. Look at intel, there has been undocumented microcode for decades I beleive these articles are just hype for street cred.
From what I have read, the microcode on M-series chips is NOT updatable. If this is the case, tsk tsk Apple.
I thought microcode was an artifact of CISC architectures, while the Arm is a RISC architecture.
So you browse the Internet with JavaScript turned off? You're a bigger person than I am ;).

The risk here is that there are more individuals with the skills to take this type of attack and bring it to a browser near you.

One apps data is another apps code.

I use uMatrix and allow first party JS. When some sites break I open the matrix look at what they would like to load and allow one more origin and reload. An example: chatgpt works by allowing JS from the first party domain, *azure.com and oaistatic.com, which looks like something from OpenAI. It would like to load JS from three other domains but it works even if I don't allow them, so there is no need to let that code run.
This is the way.

Unfortunately, I've had no luck getting others to buy into the idea that they should understand this level of detail so they can make these calls. Quite frustrating and depressing, since companies will relentlessly exploit their indifference.

If other people buy into this idea, then every site will begin proxying third-party javascript.

If the only way to get trackers on the average person is to serve it from the same first-party domain, or to bundle it in with the giant 'app.js', you better believe they'll do that.

Right now, the fact that only a small fraction of people run adblockers, and an even smaller fraction block javascript, is what allows it to work.

Not many developers do that. The general population won't even understand what they are looking at. If you are good at teaching you can give them an idea and a few of them maybe will do it, but the time invested in the allow/reload loop is probably too much unless one is really committed.

In my case when every attempt fails I know it could be the side effect of some other privacy add on. If it's a random blog/news, that's the end of it. If I really have to use that site I open Chrome, do what I have to do, close it. Of course given a choice I pick sites that work well with only JS from a few inevitable sources.

Indeed, the scary thing is that there is no theoretical limit to how sophisticated a side channel attack could be. Imagine all the timing data that could in theory be gathered from html layout engines and css, even without javascript, just by resource loading times.

I would like to salute my shitty ISP for keeping me safe from timing attacks using their unreliable network infrastructure.

This attack is now why browsers segment caching into a combination of requesting domain and asset URL, rather than just caching the asset on its own. It slows down for example Google Fonts, but means that a site can’t check to see that you’ve visited a competitor by timing an asset load from their site to see whether it’s in the cache.
The entire point of this chip is to keep secrets hidden from local processes.
For cryptographic applications yes. That is why people have spent significant effort to implement constant time algorithms to replace standard math and bitwise operations.

At the hardware level any optimizations that change performance characteristics locally (how long the crypto operation directly takes) or non locally (in this case the secrets leak via observation of cache timings in the attacker's untrusted code) are unsafe.

Intel DMPs already have a flag to turn off the same behavior that was exploited on the M1/M2. Which may suggest that the risk of this type of optimization was understood previously.

Mixing crypto operations with general purpose computation and memory accesses is a fragile balance. Where possible try utilizing HSMs, yubikeys, secure enclaves - any specialized hardware that has been hardened to protect key material.

> Where possible try utilizing HSMs, yubikeys, secure enclaves - any specialized hardware that has been hardened to protect key material.

Are there any circumstances where this hardware is accessible in the browser? As I understand, it is not generally available (if at all) for any cryptography you might want to do in the browser.

The browser doesn’t have direct access for JavaScript but can use those for supported features. This already happens for FIDO/WebAuth using a hardware root such as a Yubikey or Secure Enclave, and I believe SubtleCrypto uses hardware acceleration in some cases but I don’t remember if it makes it easy to know that.

One thing to remember here, though, is that there isn’t anything special about key material in this attack other than it being a high-value target. If we move all crypto to purpose-made hardware, someone could just start trying to target the messages to/from the crypto system.

> If we move all crypto to purpose-made hardware, someone could just start trying to target the messages to/from the crypto system.

This is one of the technical advantages of a blockchain-based system. As long as the keys are protected and signatures are generated in a secure environment, then the content of the message doesn't need to be secret to be secure.

It's not a solution to situations where privacy is desired, but if the reason for secrecy is simply to ensure that transactions are properly authorized (by avoiding the leakage of passwords and session information) then keeping the signature process secure should be sufficient even where general secrecy cannot be maintained.

In the browser is the primary use case for fido keys.
Is the untrusted code observer able to see cache timing implications of fetches to addresses that the MMU considers off limits for the process? This is what keeps surprising more, it does not align well with what I think I know about processors (not much)
> For cryptographic applications yes.

Why only cryptographic applications? What if I'm writing a very sensitive e-mail, for instance?

For this type of attack to work, the algorithm being run needs to be very well understood, and the runtime of the algorithm needs to depend almost entirely on the secret key.

In contrast, the timing of virtually any email operation is not dependent on the contents of the email, other than the size. That is, whether you wrote "my password is hunter2" or "my password is passwor", the timing of any operation running on this email will be identical.

> In contrast, the timing of virtually any email operation is not dependent on the contents of the email, other than the size.

What about spell checkers etc? Or even just whatever runs to figure out where to break the lines?

Perhaps those could be attacked. It's possible though that it's not feasible, that the possible inputs leading to a certain timing signature are just too many to get any data out of it.

Consider that those programs are not making any effort whatsoever to run in constant time, and yet no one has shown any timing attack against them. OpenSSL has taken great pains to have constant execution time, and yet subtle processor features like this still introduce enough time differences to recover the keys.

> It's possible though that it's not feasible, that the possible inputs leading to a certain timing signature are just too many to get any data out of it.

That's plausible, but a very different argument from the original, that read:

> In contrast, the timing of virtually any email operation is not dependent on the contents of the email, other than the size.

I'm imagining the email program using a formatting and rendering engine, both predictable.
Check out https://www.qubes-os.org/ for an operating system that tries to put as many layers of defense as possible between an end user's applications.
It's infuriating that all modern computers have a secure crypto TPM, but you're explicitly not allowed to use it for your own important keys, it's only for securing things against you like the DRM in certain drivers.
“Only for DRM” isn’t accurate.

I’ve been using the TPM 2.0 chip on my ASUS based Linux box to store various keys. Tooling for this on the Linux side has improved significantly [0] and it’s been supported since kernel 3.20 (2015) [1].

How effective this is at improving one’s security posture is another question and it’s probably not a huge security upgrade, but it does mitigate some classes of attack.

I’m curious why you’re saying it’s explicitly not allowed? At least for standard TPM 1.2/2.0 chips, that isn’t the case.

- [0] https://wiki.archlinux.org/title/Trusted_Platform_Module

- [1] https://www.phoronix.com/news/Linux-3.20-TPM-2.0-Security

All Apple devices allow you to use it for important keys:

https://developer.apple.com/documentation/security/certifica...

Android has user visible APIs to interact with secure crypto hardware.

https://developer.android.com/privacy-and-security/keystore#...

But I agree in general with your point

Is that not what e.g. this project allows you to do?

https://github.com/tpm2-software/tpm2-pkcs11

Absolutely critical for performance, though. If there's a way out of this it might have to be better virtualization.
It is bad, it is required for performance reasons.

The questions is what could be the solution going forward, which is going to be a huge change anyway. I do not see a way out of this with our current architectures.

Neither block ciphers, nor stream ciphers, nor common public key algorithms (RSA, Ed25519) need or even profit from this. They just need fast access to the register-register math, maybe loop sequentially through all members of a fixed sized array a fixed number of times. The only thing those implementing such algorithms would probably like having is a few kiB of safe to access scratchpad memory for code and data. On entry to the crypto code copy the code and data there, enable a constant time mode for compute instructions and run the algorithm at full speed without worrying.
Should we make a petition for apple to make lockdown-like mode that disables speculative execution? I'll sign up for that.
From the gofetch website, apple has already done this for m3 chips.
Speculation is required to get even close to the single-thread throughput expected of any modern CPU for anything worth running on a general purpose CPU. The problem is that there is no formal specified model to reason over side-channels not even just for timing side-channels. Most ISAs doesn't specify the time it takes execute instructions.

Lets assume I multiply two 64bit numbers. The CPU could just do it the same way every time and the worst-case has 4 cycles latency. It may also track if one of the factors is zero and dynamically replace the multiplication with a zeroing idiom that "executes" in 0 cycles when the scheduler learns that that either input is zero as an extreme example.

Less radical it could track if the upper halves of registers are zero to fast-path smaller multiplications (e.g. 32bit x 32bit -> 64bit) and shave off a cycle. IIRC some PowerPC chips did that, but for the second argument only. The ISA allowed it.

A realistic example are CPUs with data-dependent latency shift/rotate instructions. What do you do if an ISA doesn't specify if shift/rotate is constant time, but every implementation of it so far did it in constant time? Do you slowly emulate it out of paranoia that a future implementation may have variable latency? An other real-world example of this would be FPUs that have higher latency for denormalised numbers its just not relevant to (most) cryptographic algorithms.

How the fuck are you supposed to build anything secure, useful, and fast enough from that?

Isn’t the issue more akin to use after free? If the instructions and memory were wiped on prediction path failure wouldn’t that help?
nope. the side effects (which can be access times in case of non-failure, or voltage changes, or or or...) would still happen
This isn't speculative execution.

EDIT: The downvotes make no sense. What this bug has in common with Spectre is that it has to do with cache timing. But in Spectre, the cache is affected by speculative execution; with "GoFetch", it's the pre-fetcher pre-fetching things which look like memory addresses. Pre-fetching is not speculative execution.

Unless the comment was edited, the person you are responded to did not use the phrase "speculative execution"
You're right, it says speculation. I read it as speculative execution, I have never ever heard the term "speculation" applied to prefetching...

But if they did mean to include pre-fetching in "speculation" then I retract my comment

Perhaps we should do the crypto in constant time, and run all other applications using homomorphic encryption?
FHE is unusably slow even for the simplest operations, and there is no reason to be sure it will ever be fast enough for any normal computing.
Disabling all hardware optimizations becomes an option long before homomorphic encryption becomes an option, performance-wise.