|
|
|
|
|
by SkyPuncher
162 days ago
|
|
I've been seeing a bunch of LLM-adjacent articles recently that are focusing on being fast - and they leave me a bit stumped. While latency _can_ be a problem, reliability and accuracy are almost always my bottlenecks (to user value). Especially with chunking. Chunking is generally a one-time process where users aren't latency sensitive. |
|
And this is a bit of a sliding scale. Of course users want the best possible answer. However, if they can get 80% (magic hand-wavey fakie number) of the best answer on one second instead of 20, that may be a worthwhile tradeoff.