Hacker News new | ask | show | jobs
by llogiq 1086 days ago
I sometimes wonder what CPU designers think when building such weird instructions? They must have some programs in mind that could be run faster with them, or else the additional transistors are just lost weight. But then compilers and language runtimes might or might not use those instructions. Add to that the fact that modern CPUs are basically their own compilers (going from "machine code" to microcode) and you have weirdness atop of more weirdness. But perhaps this is just for business sake; adding more instruction sets to provide a barrier to the competition, because programs using this run faster but are no longer portable to competitors' CPUs.
6 comments

I once thought "bit shuffling" was a special-purpose use-case you'd only ever need to deinterleave RGBA channels or something. Later I wanted to implement a small look-up table (surely something more common) and realized that it's just a different name for the same operation. (I think "bit shuffling" instructions have evolved to be generic enough that they're just programmable LUTs now.)
Yeah, those vector permute instructions are super useful for both patterns. There are dedicated instructions for some specific permutations (shifting over by a constant number of bytes, and some interleavings) but you can easily end up needing the general case. And of course parallel LUTs are also very useful. Depending on what you're doing you could easily end up with both in the same algorithm.
Good for software defined radio too because real and imaginary are interleaved.
> because real and imaginary are interleaved.

And so it is in life

Yeah, I often talk about the "phase" of a schedule being close to 90. Managers are thinking % but I'm thinking degrees.
Took me a while to comprehend that. NOYCE!
I would have never thought to phrase bit-shuffling as deinterleaving, which makes so much more sense to my not-professionally-computer-related eyes and mind.

Realistically though, how likely would a GCC/clang be to emit these instructions when I'm working on some lookup tables, assuming I permit it to use them (e.g. via `-march=native` on a machine that supports the extension)? My gut feeling would be that unless I specifically make sure to structure my code to be as close to the semantics of the instructions as possible, these instructions would never ever be emitted. Or has the world of compiler optimizers advanced enough that rewriting that is commonplace now?

This may surprise you but large CPU buyers go to Intel and ask for ISA extensions, and Intel implements them. This is how BMI came to exist. The fact that you do not see these instructions exploited by your vanilla Debian box says nothing about warehouse-scale datacenter operators.
Your vanilla Debian box uses BMI2 whenever you do a string comparison, unless you are on a decade old CPU [^0].

The "strange" instructions are actually not that niche, it's just that usage tends to be "indirect" and therefore people don't notice.

[^0] E.g. https://xoranth.net/memcmp-avx2

Yes. AVX-512 support is also implemented by SIMD-JSON. JSON string parsing, definitely not a thing anybody does. /s

Also super great for emulation, and anyone else who does a lot of bulk bit-twiddling.

The whole discourse has become super weird (up to and including Linux himself ;) because of Intel 10nm delays. With the only AVX-512 products being 14nm-based intel server chips for 5 years, and then only coming to laptop for another couple years, and then only a single terrible generation of desktop parts that nobody bought, and with AMD launching super competitive (usually leading) products in those segments, obviously there wasn't a whole lot of real consistent adoption in software. And what adoption there was, was complicated by the fact that the largest adopter (server market) had to drop clocks massively and even pause processing to allow voltage to swing up enough, because they were 14nm products on a feature that was really aimed at 10nm and beyond. And then Intel yanked it out of all the desktop and laptop chips and seems poised to just ignore it for another 5 years.

Everyone just decided that because it wasn't getting adopted that it was inherently useless, up to and including Linus himself. But it wasn't getting adopted because it was a complete mess on the Intel side and AMD didn't even support it, so why bother?

The AVX-512 story is inextricably bound up in the 10nm delays and the organizational problems that have plagued Intel ever since. It's such a great thing that AMD didn't buy into the naysaying.

Famously, popcount was a common request from intelligence agencies, at least as far back as the DEC Alpha.
In Intel's case they were thinking about turning the CPU into a GPU, but failed at that several times, AVX is what is left from Larrabee.
AVX and AVX2 are pretty awful because of lane-crossing limitations, but AVX512 is actually really nice and feels like a real programmer designed it rather than an electrical engineer.
FWIW, Michael Abrash [1] was at Intel when Larrabee (the AVX512 predecessor) was being developed and apparently [2] he contributed to the ISA design.

[1] https://en.wikipedia.org/wiki/Michael_Abrash [2] https://www.anandtech.com/show/2580/9

Yeah — my favorite instructions he added were `fmad233` and `faddsets`; the former instruction essentially bootstraps the line-equation for the mask-generation for rasterization, and the latter lets you 'step' the intersection. You could plumb the valid mask through and get the logical intersection "for free". This let us compute the covering mask in 9 + N instructions for N+1 4x4 tiles. We optimized tile load-store to work in 16x16 chunks, so valid mask generation came to just 24 cycles. It was my argument that using Boustrophedon order and just blasting the tile (rather than quad-tree descent like he designed) is what convinced him to let me work with RAD & do the non-polygon path for LRB.
This is not just in your head.

Most Intel ISA extensions come from either customers asking for specific instructions, or from Intel engineers (from the hardware side) proposing reasonable extensions to what already exists.

LRBni, which eventually morphed into AVX-512, was developed by a team mostly consisting of programmers without long ties to Intel hw side, as a greenfield project to make an entirely new vector ISA that should be good from the standpoint of a programmer. I strongly feel that they have succeeded, and AVX-512 is transformative when compared to all previous Intel vector extensions.

The downside is that as they had much less input and restraint from the hw side, it's kind of expensive to implement, especially in small cores. Which directly led to its current market position.

See also: https://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4... which describes some of that history.
I think back to the Playstation 2-3 or N64 that all took years for developers to fully utilize the capabilities of the hardware. Yet the hardware engineers must have known how to do it long before the software side totally figured it out. After years of SW Development it's still just magic sand to me
It's also difficult for both SW teams' and HW teams' visions to converge, even under the same company, such that the product can be put to use to maximize performance and programmability WRT another, already-established product.

Different constraints and challenges on both sides of the aisle give rise to compromises which end up with lowered performance or lowered ease of use. This is one area where great authority over the entire stack lends you lots of leeways, e.g. Apple designing Metal API and the HW for it.

I don't know. If the HW engineers knew something, wouldn't Sony or Nintendo have had them lend a hand on 1st-party titles?
I can tell you that for 3rd party titles back in the PS2/N64 days, the HW engineers handed you a spec manual explaining such useful things as "Bit 7 at address 0x70002048 toggle RFTAG mode on the MDEC". This was great when I went to write a VU emulator because no VU debugger was available. But, not so great when trying to figure out how to use the beast effectively. If you google "ps2 sdk docs" you can still find them after a while. If it's a doc with examples of how to do anything, it's from the software team at Sony Europe.

Sony Japan's documentation for how to use a mouse & keyboard on the PS2 was literally just the URL "https://www.usb.org/document-library/usb-20-specification". Eventually, they provided a binary-only keyboard library that everyone complained was buggy, but actually just had documentation that was so brief it was easily misunderstood. After black-box testing it for an hour it was clear it worked fine, just not how anyone would expect it to.

Many years ago I made a tiny stir online by writing a stream-of-consciousness report of the experience of dealing with stuff like this for a decade. https://venturebeat.com/games/what-is-making-games-like-for-...

Actually once you have these tools at your disposal and see them, it's actually hard to not see them and find places to use them. String processing, bitset indices in data structures, etc. all have places where they naturally fit.

I have a data structure library (in Rust) where I would love to have these. The problem is that AVX-512 just isn't common enough to rely on it yet, and I don't even have it on my workstation CPU (Radeon 6850, from just last year).

But in particular whether they had something in mind, I suspect Intel was thinking about video codecs and containers for a lot of these. If you read through the specs for them, you will find all sorts of places which call for things like this.

But yes, whether compiler developers can make good use of these. Questionable. They are really for specialized optimization workflows.

I think the "shuffle" instructions are a generalization of "s-box", from various cryptographic algorithms?

> They must have some programs in mind that could be run faster with them

Yeah, all new instructions are built with some workload in mind. This may or may not be specified in the architecture manual or you have to reverse-engineer it from the press releases.