| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vhiremath4 708 days ago

There are a bunch of good answers, but I wanted to succinctly say "practically, quite a bit". Here's a good little rabbit-hole example:

> https://github.com/karpathy/nanoGPT/blob/master/model.py#L45

Karpathy's nanoGPT calling flash attention by checking if torch.nn.functional.scaled_dot_product_attention exists

> https://pytorch.org/docs/stable/generated/torch.nn.functiona...

Looking at the docs, in reality, most of the time you want this to call out to FA2 which optimizes the kernals on the device to split ops on the Softmax of the triangular matrix as well as reduce moving unnecessary batches of floating point numbers back and forth from the GPU to the CPU.

> https://arxiv.org/pdf/2307.08691

The paper for FA2 almost entirely considers itself through the hardware it's running on.