Hacker News new | ask | show | jobs
by burakemir 28 days ago
Take this with a grain of salt as I am new to this but IMHO for establishing memory hierarchy once and for all, it would be more helpful to present some abstract theory that

* Explains prefill (time to first token TTFT) vs decode (time between tokens TBT aka 1/tps)

* The various ways to schedule the computation, and the roles of runtime vs driver

* The scenarios and choices, taking into account traffic patterns, whether you are an inference service or doing batch or claw whatnot.