| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mickeyp 16 days ago

Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.

It's prefill; slow prefill kills agentic workloads dead.

If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:

    You have: 100000 / (150/s)

    You want: hms

     11 min + 6.6666667 sec

Which is quite a wait indeed.

4 comments

Aurornis 16 days ago

Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.

This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.

link

pyrolistical 15 days ago

The prefix cache is working properly 100k doesn’t prefill more than once

link

Tepix 16 days ago

When you're using OpenCode it's easy to reach 100,000 tokens after a while.

link

HarHarVeryFunny 16 days ago

I wonder if this could be usefully mitigated with a combination of prompt (prefix) caching and an agent that let you control what the prompt prefix consisted of. The goal would be to incur that slow prefill once to build the prompt cache, then have subsequent prompts consist of mostly this fixed prefix plus specific instructions.

For a language like C++ where modules are split into definition (.h) and implementation (.cpp) parts, one choice of prefix would be all the header files for the project (which aren't likely to change much).

More generally the idea would be to have an agent that had cached-prefix reuse as it's primary context management goal.

Another possibility, to support caching of files that have since changed, would be for the agent to build the context as a fixed prefix reflecting some or all of the codebase in its start-of-session state, then append any changes to that, with appropriate prompting to only use the latest definition of a function.

e.g.

Say file A initially contains functions X, Y and Z, then the prompt prefix is built to include X Y Z. If the user then modifies Y -> Y', then just add that to the context, so that the cached prefix is unchanged, giving X Y Z Y'.

link

anigbrowl 15 days ago

Can't you structure things like loading a codebase or priming with reference material to happen overnight or during meal breaks etc? I guess it's frustrating if you want to switch to a project and have the LLM begin co-working with immediately, but even the best human collaborator would require a long period to get up to speed before being able to make meaningful contributions.

link

pastage 16 days ago

A quick search say that this is a standard feature you cache the prefill and load it at PCIe bandwidth so it should be about 0.2s

link