Hacker News new | ask | show | jobs
by mdb31 1513 days ago
Cool performance enhancement, with an accompanying implementation in a real-world library (https://github.com/lemire/despacer).

Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?

3 comments

AVX-512 is an elegant, powerful, flexible set of masked vector instructions that is useful for many purposes. For example, low-cost neural net inference (https://NN-512.com). To suggest that Intel and AMD should instead make "existing REPB prefixes a lot faster" is missing the big picture. The masked compression instructions (one of which is used in Lemire's article) are endlessly useful, not just for stripping spaces out of a string!
Many people seem to think AVX-512 is just wider AVX, which is a shame.

NN-512 is cool. I think the Go code is pretty ugly but I like the concept of the compiler a lot.

Why is a large speedup from vectors surprising? Considering that the energy required for scheduling/dispatching an instruction on OoO cores dwarfs that of the actual operation (add/mul etc), amortizing over multiple elements (=SIMD) is an obvious win.
Where do I say that the speedup is surprising?

My question is whether Intel investing in AVX-512 is wise, given that: -Most existing code is not aware of AVX anyway; -Developers are especially wary of AVX-512, since they expect it to be discontinued soon.

Consequently, wouldn't Intel be better off by using the silicon dedicated to AVX-512 to speed up instruction patterns that are actually used?

AVX-512 is not going to be discontinued. Intel's reticence/struggling with having it on desktop is irritating but it's here to stay on servers for a long time.

Writing code for a specific SIMD instruction set is non-trivial, but most code will get some benefit by being compiled for the right ISA. You don't get the really fancy instructions because the pattern matching in the compiler isn't very intelligent but quite a lot of stuff is going to benefit by magic.

Even without cutting people without some AVX off, you can have a fast/slow path fairly easily.

My point is that vector instructions are fundamentally necessary and thus "what does it signal" evaluates to "nothing surprising".

Sure, REP STOSB/MOVSB make for a very compact memset/memcpy, but their performance varies depending on CPU feature flags, so you're going to want multiple codepaths anyway. And vector instructions are vastly more flexible than just those two.

Also, I have not met developers who expect AVX-512 to be discontinued (the regrettable ADL situation notwithstanding; that's not a server CPU). AMD is actually adding AVX-512.

> vector instructions are fundamentally necessary

For which percentage of users?

> AMD is actually adding AVX-512

Which is irrelevant to in-market support for that instruction set.

> For which percentage of users?

Anyone using software that benefits from vector instructions. That includes a variety of compression, search, and image processing algorithms. Your JPEG decompression library might be using SSE2 or Neon. All high-end processors have included some form of vector instruction for like 20+ years now. Even the processor in my old eBook reader has the ARM Neon instructions.

Any users who either wants performance or uses a language that can depend on a fast library.
Why would it be irrelevant? Even the paucity of availability isn't really a problem - the big winners here are server users in data centers, not desktops or laptops. How much string parsing and munging is happening ingesting big datasets right now? If running a specially optimized function set on part of your fleet reduces utilization, that's direct cost savings you realize. If the AMD is then widening that support base, you're deeply favoring expanding usage while you scale up.
Given Intel's AVX extension could cause silent failures on servers (very high work load for prolonged time, compare to end user computers), I'm not sure it would be a big win for servers either: https://arxiv.org/pdf/2102.11245.pdf.
Is it generally possible to convert rep str sequences to AVX? Could the hardware or compiler already be doing this?

AVX is just the SIMD unit. I would argue the transistors were spent on SIMD, and the hitch is simply the best way to send str commands to the SIMD hardware.

Why? IIRC something like 99% of string operations are on 20 chars or less. If you're hitting bottlenecks then optimize.
If you are arguing most string ops have just a few chars and therefore don’t use vectors… why do we need to spend silicon enhancing rep prefix in the first place?