Hacker News new | ask | show | jobs
by WanderPanda 708 days ago
Compiler folks: Is there any chance compilers will be able to find optimizations like FlashAttention on their own? Seems like TVM and tinygrad are working in that direction but I find it hard to believe that that would be feasible
5 comments

In theory, yes, it's "just" some algebraic properties of the math used that allow for substantial reordering, and then you'd add fairly regular polyhedral loop tiling. Just expensive to do, so you'll have to cache the effort.

The area of e-graph optimizers seems well-suited to this, btw. It's not really deployed outside of some niche tooling though, as it's a big paradigm shift in optimizer pass handling (e.g., doesn't work well with chairs classic call graphs, so control flow needs to be massively revamped to deploy e-graphs outside/across basic blocks and for loops (break and return not supported!)).

Just discovered e-graph recently and I have a good understanding of compiler from taking compiler class at university.

I would like to understand why you say e-graph would need control-flow to be revamped.

Do you have anything I could read on it ?

RVSDG (and the like) is needed to cope with shared state like globally addressable memory; plain e-graphs are only suited for pure code.

Happy to chat more btw, feel free to hit me up on discord.

This strikes me as an extremely difficult but not intractable problem.

I'm not sure what the state of the art in compiler optimisation is with regard to data positioning and targeting maximum processor usage

There was a video on optimisation a while back that showed small optimisations caused increases in speed that were insignificant when compared to the speed variance induced by the memory layout that the optimisation (or even a random change) caused.

While that talk was more focused on getting a signal past the noise. That noise itself is an artifact of compilers being not particularly good at handling a much simpler form of the problem you describe.

CPU and memory architectures are complex when caches and access patterns impact upon speed.

When you add in GPU architectures to the mix I think you might be in fairly uncharted territory.

Maybe one day.

Of course since we are in the field of AI there is also the question of could a sufficiently smart AI do this. It depends on the value of sufficient.

I would like to think that an extremely high level test for an AI model could be to give it something like micrograd and tell it to produce something with the same interface that outperforms torch.

We're not even in the ballpark of being able to do that yet, but it will be interesting when and if that happens.

No. Think of it like a different algorithm. You just take the shape of the hardware into consideration when designing the algorithm instead of considering math only.

> Seems like TVM

Fair enough, though technically they are still about different things but it's indeed very close, but

> and tinygrad

?????? what gives you this impression?

What's the distinction between what TVM does and FlashAttention type optimizations?
There is more than layout / tile schedule in FA. For example, first, to be able to fuse all these together [0] at all, you need to "decompose" the softmax to make it combinable, which requires maintaining some extra statistics. Won't gonna repeat the math here as the original FA paper is already very clear.

[0] so you can avoid materializing intermediate matrices and still being able to compute in blocks.

Geo has explicitly stated he wants to be able to find FA in the search space of algos eventually. Actually achieving that is another matter.
Kinda tricky if you want to call higher level operators in a wrapped language like Python.