It's slower but maybe the target audience is different? Armadillo prioritizes MATLAB like syntax. I use armadillo as a stepping stone between MATLAB prototypes and a hand rolled C++ solution, and in many scenarios it can get you a long ways down the road.
On this exact sequence, is there a LLM of choice that is really performant in this translation task? To armadillo, Eigen, Blaze or even numpy?
I have had very little success with most of the open self-hosted ones, even with my 4xA40 setup, as they either don't know the c++ libraries or generate very good-looking numpy stuff, full of horrors, simple and very very subtle bugs...
Looking for the same thing from any linear algebra library or language to cuda BTW (yes, calls to cu-blas/solver/sparse/tlass/dnn are OK), I haven't found one model able to write cuda code properly - not even kernels themselves but at least chaining library calls.
Linear algebra routines seem like one of the worst possible use cases for current LLMs.
Large amounts of repetitive yet meaningfully detailed code. Algorithms that can (and often are) implemented using different conventions or orders of operations. Edge cases out the wazoo.
A solid start seems like it would be using LLMs to write extensive test suites which you can use to verify these new implementations.
Yet for me all this C++/CUDA code is a lot of boilerplate to express dense and supposedly very tired concepts. I thought LLMs were supposed to help with the boilerplate. But yeah I guess it won't work.
And yes, it's nice to build unit test and benchmark harnesses. But those were never really such time-wasters for me.
Tough to say something as blanket as "it's slower"... there are lots of operations in any linear algebra library. It's not a direct comparison with other C++ linear algebra libraries, but hard to say Armadillo is slow based on benchmarks like this:
beating MKL for <100x100 is pretty doable. the BLAS framework has a decent amount of inherent overhead, so just exposing a better API (e.g. one that specifies the array types and sizes well) makes it pretty easy to improve things. For big sizes though, MKL is incredibly good.
If you are talking about non-small matrix multiplication in MKL, is now in opensource as a part of oneDNN. It literally has exactly the same code, as in MKL (you can see this by inspecting constants or doing high-precision benchmarks).
For small matmul there is libxsmm. It may take tremendous efforts make something faster than oneDNN and libxsmm, as jit-based approach of https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/g... is too flexible: if someone finds a better sequence, oneDNN can reuse it without major change of design.
But MKL is not limited to matmul, I understand it...
Aaaand debug times. And profiling. I'd forgotten the joys of debugging/tracing heavily templated code before I jumped back into Eigen. Not that MKL was easier to debug but nowadays most of oneapi is open-source, at least the parts I use?
> almost all workloads aren't anywhere near saturating the AVX instruction max bandwidth on a CPU since Haswell
That’s true, but GPUs aren’t only good at FLOPs, the memory bandwidth in them is also an order of magnitude faster than system memory.
In my previous computer, the numbers were 484 GB/second for 1080 Ti, and 50 GB/second for DDR4 system memory. In my current one, they are 672 GB/second for 4070 Ti super, and 74 GB/second for DDR5 system memory.
I'm by no means an expert in the topic, but to share my take anyway: It seems to me like there's just diminishing returns in SIMD approaches. If you're going to organize your data well for SIMD use then it's not a far reach to make it work well on a gpu, which will keep getting more cores.
I imagine we'll get to a point where CPUs are actually just pretty dumb drivers for issuing gpu commands.
I don't think that there's a "win" here. It's just sort of which way you tilt your head, how much space do you have to cram a ton of cores connected to a really wide memory bus and how close can you get the storage while keeping everything from catching on fire, no? ("just sort of" is going to have to skip leg day because of the herculean lift it just did)
It's a fairly fractal pattern in distributing computing. Move the high throughput heavy computation bits away from the low latency responsive bits ("low latency" here is relative to the total computation). Use an event loop for the reactive bits. Eventually someone will invert the event loop to use coroutines so everything looks synchronous (Go, anyone? python's gevent?).
After it seems to me that the only real question is if takes too long or costs too much to move the data to the storage location the heavy computation hardware uses. There's really not much of a conceptual difference between airflow driving snowflake and c++ running on a cpu driving cuda kernels. It takes a certain scale to make going from a OLTP database to an OLAP database worth it, just like it takes a certain scale to make a GPU worth it over simd instructions on the local processor.
Yes and no. The compute density and memory bandwidth is unmatched. But the programming model is markedly worse, even for something like CUDA: you inherently have to think about parallelism, how to organize data, write your kernels in a special language, deal with wacky toolchains, and still get to deal with the CPU and operating system.
There is great power in the convenience of "with open('foo') as f:". Most workloads are still stitching together I/O bound APIs, not doing memory-bound or CPU-bound compute.
CUDA was always harder to program - even if you could get better perf
It took a long time to find something that really took advantage of it, but we did eventually. CUDA enabled deep learning which enabled LLMs . That's history.
What surprised me about the statement was that it implied that the model of python driving optimized GPU kernels was broader than deep learning.
That was the original vision of CUDA - most of the computational work being done by massively parallel cores
GPUs are still very limited, even compared to the SIMD instruction set. You couldn't make a CUDAjson the same way the SIMDjson library is built for example, because it doesnt handle SIMD branching in a way that accomodates it.
Second, again, the latency issue. GPUs are only good if you have a pipeline of data to constantly feed it, so that the PCIe transfer latency issue is minimal.
With PCIe 4 and 5 the latency issues are not as much a problem as they were, what with latency masking, gpudirect/storage-direct, busy-loop kernels (and hopefully soon scheduling libraries to make them easier to use) :-) and if you're really into real-time, computing time on NVIDIA GPUs has excellent jitter/stability and they are used in the very tight control loop of adaptive-optics (1ms-loop with mechanical actuators to drive).
The penalty for branching has reduced in the last years, but yeah it's still heavy, but if you're OK with a bit of wasted compute, you can do some 'speculative' execution and do both branches in different warps, use only one result...
Depends on whether you measure workloads as "jobs" or "flops". If "flops", I would hazard that the bulk of computing on the planet right now is happening on GPUs.
The rise of frontend developers over the last 5 years learned everything must be new.
That a math library of all things could be complete is several orders of thinking beyond their ability. I'm sure the gut reaction is to downvote this for the embarrassing criticism, but in all seriousness, this is the right answer.
Sure code can be “feature complete” but the reality is the rest of the world changes, so there will be more and more friction for your users over time. For example someone in the issue mentions they need to use mainline to use eigen with cuda now.
Mathematics is a priori. It's beyond the world changing. You might be surprised to learn we still use Euclid's geometry despite it being thousands of years old.
What you're actually saying is you expect open source maintainers to add arbitrary functionality for free.
Randomized linear algebra and under-solving (mixed precision or fp32 instead of fp64) seem to be taking off more than in the past, mostly on gpu though (use of tensor cores, expensive fp64, memory bandwidth limits).
And I wish Eigen had a larger spectrum of 'solvers' you can chose from, depending on what you want. But in general I agree with you, except there's always a cycle to eke out somewhere, right?
https://arma.sourceforge.net/