| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kouteiheika 301 days ago

> But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip.

This is a self-inflicted wound, since flash attention insist on building a native C++ extension which is completely unnecessary in this case.

What you can do is the following:

1) Compile your CUDA kernels offline. 2) Include those compiled kernels in a package you push to pypi. 3) Call into the kernels with pure Python, without going through a C++ extension.

I do this for the CUDA kernels I maintain and it works great.

Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.

[1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...

1 comments

twothreeone 301 days ago

While shipping binary kernels may be a workaround for some users, it goes against what many people would consider "good etiquette" for various valid reasons, such as hackability, security, or providing free (as in liberty) software.

link

rkangel 301 days ago

Shipping binary artifacts isn't inherently a bad thing - that's what (most) Linux Distros do after all! The important distinction is how that binary package was arrived at. If it's a mystery and the source isn't available then that's bad. If it's all in the open source repo and part of the Python package build and is completely reproducible then that's great.

link

benreesman 301 days ago

GP's solution is the one that I (and I think most people) ultimately wind up using. But it's not a nice time in the usual "oh we'll just compile it" sense of a typical package.

flash-attn in particular has its build so badly misconfigured and is so heavy that it will lock up a modern Zen 5 machine with 128GB of DDR5 if you don't re-nice ninja (assuming of course that you remembered it just won't work without a pip-visible ninja). It can't build a wheel (at least not obviously) that will work correctly on Ampere and Hopper both, it incorrectly declares it's dependencies so it will demand torch even if torch is in your pyproject.toml and you end up breaking build isolation.

So now you've got your gigabytes of fragile wheel that won't run on half your cards, let's make a wheel registry. Oh, and machine learning everything needs it: half of diffusers crashes at runtime without it. Runtime.

The dirty little secret of these 50MM offers at AI companies is that way more people understand the math (which is actually pretty light compared to say graduate physics) than can build and run NVIDIA wheels at scale. The wizards who Zuckerberg will fellate are people who know some math and can run Torch on a mixed Hopper/Blackwell fleet.

And this (I think) is Astral's endgame. I think pyx is going to fix this at scale and they're going to abruptly become more troublesome to NVIDIA than George Hotz or GamersNexus.

link

agos 301 days ago

Dumb question from an outsider - why do you think this is so bad? Is it because so much of the ML adjacent code is written by people with background in academia and data science instead of software engineering? Or is it just Python being bad at this?

link

structural 301 days ago

1. If you want to advance the state of the art as quickly as possible (or have many, many experiments to run), being able to iterate quickly is the primary concern.

2. If you are publishing research, any time spent beyond what's necessary for the paper is a net loss, because you could be working on the next advancement instead of doing all the work of making the code more portable and reusable.

3. If you're trying to use that research in an internal effort, you'll take the next step to "make it work on my cloud", and any effort beyond that is also a net loss for your team.

4. If the amount of work necessary to prototype something that you can write a paper with is 1x, the amount of work necessary to scale that on one specific machine configuration is something like >= 10x, and the amount of work to make that a reusable library that can build and run anywhere is more like 100x.

So it really comes down to - who is going to do the majority of this work? How is it funded? How is it organized?

This can be compared to other human endeavours, too. Take the nuclear industry and its development as a parallel. The actual physics of nuclear fission is understood at this point and can be taught to lots of people. But to get from there to building a functional prototype reactor is a larger project (10x scale), and then scaling that to an entire powerplant complex that can be built in a variety of physical locations and run safely for decades is again orders of magnitude larger.

link

benreesman 301 days ago

TLDR: Broken builds are the default in everything, only exceptional effort and resources get you anything else, in Python, the people with those resources have unclear incentives for anything to improve.

I think it's a combination of historical factors and contemporary misaligned incentives both in the small and the large. There are also some technical reasons why Python is sort of an "attractive nuisance" for really problematic builds.

The easy one that shouldn't be too controversial is that it has a massive C/C++ (and increasingly Rust) native code library ecosystem. That's hard to do under the best of circumstances, but it's especially tough in Python (paradoxically because Python is so good at this: when wrapping the fast library that's proven is really easy you do it all the time). In the absence of really organized central planning and real "SAT Solver Class" package managers (like `uv`, not like `pip`), a mess is more or less just nature taking it's course. That's kinda how we got here (or how we got to 2016 maybe).

But lots of language ecosystems predate serious solvers and other modern versioning, why is Python such a conspicuous mess on this in 2025? How can friggin Javascript have it together about 100x better?

That's where the bad incentives kick in. In the small, there is a lingering prestige attached to "AI Researcher" that makes about zero sense in a world where we're tweaking the same dozen architectures and the whole game is scaling it, but that's the way the history went. So people who need it to work once to write a paper and then move on? `pip freeze` baby, works on my box. Docker amplifies this "socialize the costs" thing because now you can `pip freeze` your clanky shit, spin it up on 10k Hopper cards, and move on. So the highest paid, most regarded, most clout-having people don't directly experience the pain, it's an abstraction to them.

In the large? If this shit worked then (hopefully useful oversimplification alert) FLOPs would be FLOPs. The LAPACK primitives and even more modern GEMM instructions can be spelled some fast way on pretty much any vendor's stuff. NVIDIA is usually ahead a word-shrink or two, but ROCm in principle supports training at FP8 and on CDNA (expensive) cards it does, on RNDA (cheap) cards, it says it does on the label but crashes under load so you can't use it if your time is worth anything.

The big labs and FAANGs are the kind of dark horse here. In principle you'd assume Meta would want all their Torch code to run on AMD, but their incentives are complicated, they do a lot of really dumb shit that's presumably good for influential executives because it's bad for shareholders. It's also possible that they've just lost the ability to do that level of engineering, it's hard and can't be solved by numbers or money alone.

link

kouteiheika 301 days ago

It's not a workaround; it's the most sane way of shipping such software. As long as the builds are reproducible there's nothing wrong with shipping binaries by default, especially when those binaries require non-trivial dependencies (the whole CUDA toolchain) to build.

There's a reason why even among the most diehard Linux users very few run Gentoo and compile their whole system from scratch.

link

benreesman 301 days ago

I agree with you that binary distribution is a perfectly reasonable adjunct to source distribution and sometimes even the more sensible one (toolchain size, etc).

In this instance the build is way nastier than building the NVIDIA toolchain (which Nix can do with a single line of configuration in most cases), and the binary artifacts are almost as broken as the source artifact because of NVIDIA tensor core generation shenanigans.

The real answer here is to fucking fork flash-attn and fix it. And it's on my list, but I'm working my way down the major C++ packages that all that stuff links to first. `libmodern-cpp` should be ready for GitHub in two or three months. `hypermodern-ai` is still mostly a domain name and some scripts, but they're the scripts I use in production, so it's coming.

link

kouteiheika 301 days ago

I thought about fixing Flash Attention too so that I don't have to recompile it every time I update Python or pytorch (it's the only special snowflake dependency that I need to manually handle), but at the end of the day it's not that much of a pain to justify the time investment.

If I'm going to invest time here then I'd rather just write my own attention kernels and also do other things which Flash Attention currently doesn't do (8-bit and 4-bit attention variants similar to Sage Attention, and focus on supporting/optimizing primarily for GeForce and RTX Pro GPUs instead of datacenter GPUs which are unobtanium for normal people).

link

benreesman 301 days ago

I usually think the same way, and I bet a lot of people do which is why its still broken. But I've finally decided it's never going away completely and it's time to just fix it.

link