Hacker News new | ask | show | jobs
by rfoo 708 days ago
No. Think of it like a different algorithm. You just take the shape of the hardware into consideration when designing the algorithm instead of considering math only.

> Seems like TVM

Fair enough, though technically they are still about different things but it's indeed very close, but

> and tinygrad

?????? what gives you this impression?

2 comments

What's the distinction between what TVM does and FlashAttention type optimizations?
There is more than layout / tile schedule in FA. For example, first, to be able to fuse all these together [0] at all, you need to "decompose" the softmax to make it combinable, which requires maintaining some extra statistics. Won't gonna repeat the math here as the original FA paper is already very clear.

[0] so you can avoid materializing intermediate matrices and still being able to compute in blocks.

Geo has explicitly stated he wants to be able to find FA in the search space of algos eventually. Actually achieving that is another matter.