Hacker News new | ask | show | jobs
by mfro 70 days ago
Strangely, it is super fast on my 16 Plus, but with longer messages it can slow down a LOT, and not because of thermal throttling. I wish I could see some diagnostic data.
1 comments

Inference from an LLM is O(tokens^2)
Only in the naive implementations of attention