| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chillee 974 days ago
	I cover it a bit in the blog post, but unless you have a really long context length (like 32k+), your primary computational cost doesn't come from attention but rather from loading your weights from VRAM into registers. I mean, practically speaking, completions from say, ChatGPT or Claude take seconds to finish :)