Hacker News new | ask | show | jobs
by gumby 746 days ago
> good vector performance is still important (if less important than integer)

This is in part (major part IMHO) because few languages support vector operations as first class operators. We are still trapped in the tyranny that assumes a C abstract machine.

And so because so few languages support vectors, the instruction mix doesn’t emphasize it, therefore there’s less incentive to work on new language paradigms, and we remained tapped in a suboptimal loop.

I’m not claiming there are any villains here, we’re just stuck in a hill-climbing failure.

2 comments

It’s not obvious that that’s what’s happened here. Eg vector scheduling is separated but there are more units for actually doing certain vector operations. It may be that lots of vector workloads are more limited by memory bandwidth than ILP so adding another port to the scheduler mightn’t add much. Being able to run other parts of the cpu faster when vectorised instructions aren’t being used could be worth a lot.
That matches with recent material I've read on vectorized workloads: memory bandwidth can become the limiting factor.
Always nice to see people rediscovering the roofline model.
But isn’t that why we have things like CUDA? Who exactly is “we” here, people who only have access to CPU’s? :)
I’m not saying that you cannot write vector code, but that it’s typically a special case. CUDA APIs and annotations are bolted onto existing languages rather than reflecting languages with vector operations as natural first class operations.

C or Java have no concept of `a + b` being a vector operation the way a language like, say, APL does. You can come closer in C++, but in the end the memory model of C and C++ hobbles you. FORTRAN is better in this regard.

I see two options from this perspective.

It is always possible to inline assembler in C, and present vector operators as functions in a library.

Otherwise, R does perceive vectors, so another language that performs well might be a better choice. Julia comes to mind, but I have little familiarity with it.

With Java, linking the JRE via JNI would be an (ugly) option.

Makes sense. I guess that’s why some python libs use it under the hood
What about Rust?
When the data is generated on CPU shoveling it to the GPU to do possibly a single or few vector operations and then shoveling it back to the CPU to continue is most likely going to be more expensive than the time saved.

And CUDA is Nvidia specific.

Doesn’t CUDA also let you execute on the CPU? I wonder how efficiently.
No - a CUDA program consists of parts that run on the CPU as well as on the GPU, but the CPU (aka host) code is just orchestrating the process - allocating memory, copying data to/from the GPU, and queuing CUDA kernels to run on the GPU. All the work (i.e. running kernels) is done on the GPU.

There are other libraries (e.g. OpenMP, Intel's oneAPI) and languages (e.g. SYCL) that do let the same code be run on either CPU or GPU.

When you use a GPU, you are using a different processor with a different ISA, running its own barebones OS, with which you communicate mostly by pushing large blocks of memory through the PCIe bus. It’s a very different feel from, say, adding AVX512 instructions to your program flow.