Hacker News new | ask | show | jobs
by dramlord 921 days ago
You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding
1 comments

Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.