Hacker News new | ask | show | jobs
by aperture147 1054 days ago
As an average developer who works on high level interface, I don't really see the benefit of AVX-512. I've heard that some math calculating software MAY gains some benefit from AVX instructions (like BLAS), but I've never use it personally. Can you guys please explain?
2 comments

Anything which works on nontrivial datasets, really. As soon as you need to do roughly the same operation on a bunch of data, you can benefit from AVX. Heck, this could be as simply as determining the length of a string!

The main benefit of AVX-512 is that the CPU gained support for "masking". Traditionally, you had to execute the instructions on all data elements. Something like "round all even values downwards, round all odd values upwards" becomes near-impossible, and if you want to do that in an entire AVX pipeline you have to deconstruct the vector into individual elements, do the operation on each of them individually, and reconstruct it into a vector. This really sucks. With masking, you get a special field which specify on which elements the operation should apply. So you could just create a mask with the even values, use that mask for a round-down, invert the mask, and use it for a round-up. That's a significant speedup!

Adding AVX-512 support means that a lot more applications suddenly become eligible for fairly trivial vectorization, and I personally can't wait for it to become universally available.

Vector instructions speed up things like:

    for (int i = 0; i < N; i++)
    {
        a[i] = b[i]*c[i];
    }
Effectively, they let you do this in chunks of 2, 4 or 8 elements in a single instruction instead of element by element. Usually the clock speed of the processor drops slightly while executing these instructions but not so much as to make it slower than not having the instruction.

New instruction sets usually do two things - add support for new operations, and widen the registers and allow you to perform the operation on more elements at a time.

If you’re writing in a high level language, you’ll only really see performance improvements if your interpreter or library takes advantage of them. For e.g. the Python library NumPy makes good use of vectorisation.

In terms of writing low level code, simple loops are often auto-vectorised by the compiler so you often see a sort of odd style where a loop doing many things is split up so the compiler can deal with it. You often end up having to run the code through something like VTune to work out whether a particular loop has actually vectorised.