Hacker News new | ask | show | jobs
by pico_creator 537 days ago
lower compute cost especially over longer sequence length. Depending on context length, its 10x, 100x, or even 1000x+ cheaper. (quadratic vs linear cost difference)