There is more than layout / tile schedule in FA. For example, first, to be able to fuse all these together [0] at all, you need to "decompose" the softmax to make it combinable, which requires maintaining some extra statistics. Won't gonna repeat the math here as the original FA paper is already very clear.
[0] so you can avoid materializing intermediate matrices and still being able to compute in blocks.
[0] so you can avoid materializing intermediate matrices and still being able to compute in blocks.