Hacker News new | ask | show | jobs
by adiraja 544 days ago
We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.

So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.

1 comments

So, while time to first token is lower, throughput might also be lower in most cases?
Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.

But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.

The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.

I don’t really get it. Prefill saturates compute and decode saturates memory bandwidth. Why are you not doing mixed batch?
You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.

When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.

By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.

Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!