Hacker News new | ask | show | jobs
by dannyz 797 days ago
It seems like every large project these days has coalesced around Eigen, what are some of the advantages that Blaze has over Eigen?
5 comments

I'm surprised people think this, there is also the widely-used Armadillo linear algebra library. In my opinion it has a much nicer syntax.

https://arma.sourceforge.net/

How's the performance?

EDIT: also being on Sourceforge is kind of a hinderance to discovery these days. I wonder why they chose to be on there instead of github?

It's slower but maybe the target audience is different? Armadillo prioritizes MATLAB like syntax. I use armadillo as a stepping stone between MATLAB prototypes and a hand rolled C++ solution, and in many scenarios it can get you a long ways down the road.
On this exact sequence, is there a LLM of choice that is really performant in this translation task? To armadillo, Eigen, Blaze or even numpy?

I have had very little success with most of the open self-hosted ones, even with my 4xA40 setup, as they either don't know the c++ libraries or generate very good-looking numpy stuff, full of horrors, simple and very very subtle bugs...

Looking for the same thing from any linear algebra library or language to cuda BTW (yes, calls to cu-blas/solver/sparse/tlass/dnn are OK), I haven't found one model able to write cuda code properly - not even kernels themselves but at least chaining library calls.

Probably doesn't exist (invoking Cunningham's Law).

Linear algebra routines seem like one of the worst possible use cases for current LLMs.

Large amounts of repetitive yet meaningfully detailed code. Algorithms that can (and often are) implemented using different conventions or orders of operations. Edge cases out the wazoo.

A solid start seems like it would be using LLMs to write extensive test suites which you can use to verify these new implementations.

Yet for me all this C++/CUDA code is a lot of boilerplate to express dense and supposedly very tired concepts. I thought LLMs were supposed to help with the boilerplate. But yeah I guess it won't work.

And yes, it's nice to build unit test and benchmark harnesses. But those were never really such time-wasters for me.

Tough to say something as blanket as "it's slower"... there are lots of operations in any linear algebra library. It's not a direct comparison with other C++ linear algebra libraries, but hard to say Armadillo is slow based on benchmarks like this:

https://conradsanderson.id.au/pdfs/sanderson_curtin_armadill...

According to the provided benchmarks [1], it seems to be quite a bit faster.

[1] https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks

These benchmarks look to be ~8 years old, and don't really agree with benchmarks done by other sources (https://romanpoya.medium.com/a-look-at-the-performance-of-ex..., https://eigen.tuxfamily.org/index.php?title=Benchmark)

In general I would be skeptical about any benchmark that claims to beat MKL significantly on standard operations

beating MKL for <100x100 is pretty doable. the BLAS framework has a decent amount of inherent overhead, so just exposing a better API (e.g. one that specifies the array types and sizes well) makes it pretty easy to improve things. For big sizes though, MKL is incredibly good.
If you are talking about non-small matrix multiplication in MKL, is now in opensource as a part of oneDNN. It literally has exactly the same code, as in MKL (you can see this by inspecting constants or doing high-precision benchmarks).

For small matmul there is libxsmm. It may take tremendous efforts make something faster than oneDNN and libxsmm, as jit-based approach of https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/g... is too flexible: if someone finds a better sequence, oneDNN can reuse it without major change of design.

But MKL is not limited to matmul, I understand it...

Compile times for one.

Eigen uses C++ templates to do most things, which explodes compile times.

AFAIK blaze is also somewhat heavy on templates, but maybe it uses more modern metaprogramming techniques.
Compile times and binary sizes :(
Aaaand debug times. And profiling. I'd forgotten the joys of debugging/tracing heavily templated code before I jumped back into Eigen. Not that MKL was easier to debug but nowadays most of oneapi is open-source, at least the parts I use?
Or cuBLAS. In practice, if I'm going through the trouble to rewrite math in C++, I'd rather just make GPU kernels.
I mean, that only works for a small subset of workloads where the data movement patterns fit, the bandwidth is more important than the latency, etc.

The reality is that almost all workloads aren't anywhere near saturating the AVX instruction max bandwidth on a CPU since Haswell.

> almost all workloads aren't anywhere near saturating the AVX instruction max bandwidth on a CPU since Haswell

That’s true, but GPUs aren’t only good at FLOPs, the memory bandwidth in them is also an order of magnitude faster than system memory.

In my previous computer, the numbers were 484 GB/second for 1080 Ti, and 50 GB/second for DDR4 system memory. In my current one, they are 672 GB/second for 4070 Ti super, and 74 GB/second for DDR5 system memory.

I'm by no means an expert in the topic, but to share my take anyway: It seems to me like there's just diminishing returns in SIMD approaches. If you're going to organize your data well for SIMD use then it's not a far reach to make it work well on a gpu, which will keep getting more cores.

I imagine we'll get to a point where CPUs are actually just pretty dumb drivers for issuing gpu commands.

As someone who worked on CUDA 15 years ago - it’s amazing to me that someone on the internet posted this statement.

Did GPUs win?

I don't think that there's a "win" here. It's just sort of which way you tilt your head, how much space do you have to cram a ton of cores connected to a really wide memory bus and how close can you get the storage while keeping everything from catching on fire, no? ("just sort of" is going to have to skip leg day because of the herculean lift it just did)

It's a fairly fractal pattern in distributing computing. Move the high throughput heavy computation bits away from the low latency responsive bits ("low latency" here is relative to the total computation). Use an event loop for the reactive bits. Eventually someone will invert the event loop to use coroutines so everything looks synchronous (Go, anyone? python's gevent?).

After it seems to me that the only real question is if takes too long or costs too much to move the data to the storage location the heavy computation hardware uses. There's really not much of a conceptual difference between airflow driving snowflake and c++ running on a cpu driving cuda kernels. It takes a certain scale to make going from a OLTP database to an OLAP database worth it, just like it takes a certain scale to make a GPU worth it over simd instructions on the local processor.

Yes and no. The compute density and memory bandwidth is unmatched. But the programming model is markedly worse, even for something like CUDA: you inherently have to think about parallelism, how to organize data, write your kernels in a special language, deal with wacky toolchains, and still get to deal with the CPU and operating system.

There is great power in the convenience of "with open('foo') as f:". Most workloads are still stitching together I/O bound APIs, not doing memory-bound or CPU-bound compute.

CUDA was always harder to program - even if you could get better perf

It took a long time to find something that really took advantage of it, but we did eventually. CUDA enabled deep learning which enabled LLMs . That's history.

What surprised me about the statement was that it implied that the model of python driving optimized GPU kernels was broader than deep learning.

That was the original vision of CUDA - most of the computational work being done by massively parallel cores

Win what? This person said they were inexperienced. SIMD is extremely valuable and the situations where it works well are not rare at all.
Not really, no.

GPUs are still very limited, even compared to the SIMD instruction set. You couldn't make a CUDAjson the same way the SIMDjson library is built for example, because it doesnt handle SIMD branching in a way that accomodates it.

Second, again, the latency issue. GPUs are only good if you have a pipeline of data to constantly feed it, so that the PCIe transfer latency issue is minimal.

With PCIe 4 and 5 the latency issues are not as much a problem as they were, what with latency masking, gpudirect/storage-direct, busy-loop kernels (and hopefully soon scheduling libraries to make them easier to use) :-) and if you're really into real-time, computing time on NVIDIA GPUs has excellent jitter/stability and they are used in the very tight control loop of adaptive-optics (1ms-loop with mechanical actuators to drive).

The penalty for branching has reduced in the last years, but yeah it's still heavy, but if you're OK with a bit of wasted compute, you can do some 'speculative' execution and do both branches in different warps, use only one result...

But yes, you're still using an accelerator.

Depends on whether you measure workloads as "jobs" or "flops". If "flops", I would hazard that the bulk of computing on the planet right now is happening on GPUs.
Is Eigen still alive? There's been no release in 3 years, and no news about it: https://gitlab.com/libeigen/eigen/-/issues/2699
The master branch is active and people use Eigen today. The Discord has maintainers that are still active. Not sure how it could be considered "dead"?
The rise of frontend developers over the last 5 years learned everything must be new.

That a math library of all things could be complete is several orders of thinking beyond their ability. I'm sure the gut reaction is to downvote this for the embarrassing criticism, but in all seriousness, this is the right answer.

I realize asking for a new 4.0 release is fair (and the GitLab issue does have a highly upvoted request for a release).

But you can't just call things "dead" for no reason, it's in poor taste. It's feature-complete, not dead!

Sure code can be “feature complete” but the reality is the rest of the world changes, so there will be more and more friction for your users over time. For example someone in the issue mentions they need to use mainline to use eigen with cuda now.
Mathematics is a priori. It's beyond the world changing. You might be surprised to learn we still use Euclid's geometry despite it being thousands of years old.

What you're actually saying is you expect open source maintainers to add arbitrary functionality for free.

What? You mean I don't need to refactor and break API every 6 months?
I mean, it's not like linear algebra has changed that much in 4 years?
Randomized linear algebra and under-solving (mixed precision or fp32 instead of fp64) seem to be taking off more than in the past, mostly on gpu though (use of tensor cores, expensive fp64, memory bandwidth limits).

And I wish Eigen had a larger spectrum of 'solvers' you can chose from, depending on what you want. But in general I agree with you, except there's always a cycle to eke out somewhere, right?

Too many people have their brain rotted from the web dev world where things are reinvented every other week.