Hacker News new | ask | show | jobs
by Xeamek 760 days ago
Why cant we 'simply' get processor extension to mark data as pointer so that the prefetcher would actually know what to fetch.

From my understanding this is what led to the recent 'unpatchable' exploit in Apple's M1, but rather then trying to guess it by some heuristic, why not just give compilers option to make that optimization?

3 comments

Apple Silicon does this. If it prefetches something that looks like a pointer, it will also fetch the pointed-to memory. It's a cool feature, and is especially useful for Apple, since their Objective-C collections only store pointers – but it also can re-open the door for certain timing attacks by violating preconditions of constant-time cryptography algorithms.

https://en.wikipedia.org/wiki/GoFetch

I don't think making CPU issue (likely bogus) pre-fetches for every field in the cache line that's marked as a pointer is really that good idea. At best, you save couple of cycles because the fetches are started a one or two instructions earlier before the actual load instruction for "loading the linked pointer" is issued. At worst, you keep thrashing your cache loading data you're not going to read, delaying fetching the data you will read.
For everything? Obviously no.

But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy

Because if you're issuing a bogus pre-fetch, you can't cancel it, can you? So that's 90 or something cycles that's the fetch for your actual data is being delayed. Pointer chasing already strains the memory bandwidth, trying to request even more data from memory will only worsen things.

And unrolling loop for traversing linked lists can be done, if you use a sentinel node instead of nullptr to signal then end:

        beqz    a0, .end
    .loop:
        ld      a1, 0(a0)   ; a1 = curr->data
        ld      a0, 8(a0)   ; curr = curr->next
        ; do something with payload in a1 here
        bnez    a1, .loop
    .end:
becomes

        la      s1, sentinel
        beq     a0, s1, .end
        ld      a1, 0(a0)
        ld      a2, 8(a0)
        ld      a3, 0(a2)
        ld      a4, 8(a2)
        ld      a5, 0(a4)
        ld      a6, 8(a4)
        beq     a6, s1, .trail
    .loop:
        ld      t0, 0(a6)
        ld      t1, 8(a6)
        ld      t2, 0(t1)
        ld      t3, 8(t1)
        ld      t4, 0(t3)
        ld      t5, 8(t3)
        ; do something with three payloads in a1, a3, a5 here
        mv      a1, t0
        mv      a2, t1
        mv      a3, t2
        mv      a4, t3
        mv      a5, t4
        mv      a6, t5
        bne     t5, s1, .loop
        mv      a0, a2
        beq     a2, s1, .end
     .trail:
        ld      a1, 0(a0)
        ld      a0, 8(a0)
        ; do something with payload in a1 here
        bne     a0, s1, .trail
     .end:
As you can see, "ld t3, 8(a2)" is almost right after to "ld t1, 8(a6)", with intervening load from 0(a2), so prefetch won't noticeably help here, and if the address that ends up in t3 is not in the cache, then "ld t5, 8(t3)" will stall no matter what. And moving the speculative loads up in the loop body before processing the payloads (using even more registers, as you can see) somewhat hurts the latency of processing the first three payloads.

Oh, and if you want to see something really crazy, look at e.g. splitting the branch instruction into prediction and resolution instructions [0].

[0] https://zilles.cs.illinois.edu/papers/branch_vanguard_isca_2...

This was the idea of Itanium. It failed mostly because of economics.

It turns out programmers, or rather their employers, don't really care about using hardware efficiency. They care about shipping things yesterday, because that's how business deals get closed, and making software efficient is secondary to money changing hands. Performance never really matters to business people after the checks are cashed.

Multicore computers have been ubiquitous for more than a decade, yet the overwhelming majority of software built today is single-threaded microservices, where in they spend most of their time serializing and deserializing message payloads.

This is all really to say that most performance is already being left on the table for the majority of what computers are used for today.

I do want to say that I think the Itanic would have fared way, way better in a post-LLVM world where the importance of smart, optimizing compilers is much more valued and understood and language designers actively work hand-in-hand with compiler devs far more often (with much more significant back-and-forth from hardware manufacturers).
I don't think LLVM is particularly good at optimizing VLIW code.

Very good optimizing compilers existed before LLVM. Intel had one specifically for Itanium. It wasn't enough.

Why would llvm be particularly good at optimizing vliw code when there’s no demand for it to be? You can’t believe everything else would remain the same in the hypothetical I posed.
A) optimizing for VLIW is hard. B) the null hypothesis would be no change.
Given how much of today's computer needs are dependent on a database query, this is no surprise. Who cares about the micros you gain with added efficiency while there's a 100ms db query return in the path?
Apparently Apple and Intel do, since they introduced those changes into their silicon
Not every pipeline involves a db query.
Where do you think DBs run?
I mean sure, I don't doubt 99% of end-user programmers wouldn't look twice at something like this, but compilers designers probably would care.

And it's not like the companies arent trying this idea (again, M1 exploit). But for whatever reason they want to keep cpus as black box, perfect abstract machines, even though we know they aren't