|
|
|
|
|
by killcoder
30 days ago
|
|
I think 'actual parallelism' is a vastly easier and more fruitful way to get better performance out of these kinds of systems, compared to pushing against single-threaded faster generation. Tool calling and responses are often embarrassingly parallel. Code generation tasks naturally have a dependency tree that can be unrolled into a fixed budget of parallelism. Tasks can be hierarchically decomposed into subtasks. It's the same asynchronous stream pattern we're used to dealing with in regular software engineering. We have a fixed thread pool, lots of work that can be scheduled concurrently. Since these are streams, we can do the compute incrementally to reduce the time-to-first-byte/token/response. Since so many tool calls are inherently asynchronous, and subagent task decomposition can be modelled as such, the IO streams can be oversubscribed, and incoming responses can be priority queued. On the intelligence front, it's incredible how much better frontier models perform when you just interrupt them every so often and go 'is that the best you can do?', or re-iterate instructions, or repeat the overall goal. I find instruction following _so poor_, especially for 'presentation layer' aspects. Yet if I ask the model to rewrite its last response, it does so perfectly. Why can't the model do this 'internally' and save me having to say 'try again'! Just because the 'model' is autoregressive doesn't mean the system as a whole needs to present a single stream of immutable text. |
|
Works fine, forcing LLMs to output parsable responses is a good workaround to get them to do what you want until they improve. It also allows you to use the fast models (ex. I spin up the Gemini 3.1 flash lite model for these tasks) to have these tasks done in seconds rather than minutes.