| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1dom 24 days ago

I like work in this area, and this is really helpful, thanks. I actively avoid cloud based LLMs and mainly use 4b - 30a3b param local models. This means I don't really have a good grasp of SOTA LLM performance or accuracy, but I know what to expect when dealing with local models, and where the pain points are.

I've only skimmed the post and read the abstract and in some places you make a nod to how simple tweaks can make something 10x faster/slower, but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed.

Specifically for agentic workflows and local models, accuracy around function/tool calling hasn't been a problem for me now for about 6 - 12 months, personally, since around QwenCoder3. The main issue is context management and the impact on timing, since agents will often swap prompts and break prompt caching and similar timing improvements.

It looks like your work adds a layers and wrappers like guard rails and retries. This would make my local model experience - specifically for agents - unusable because of the delays it would add.

I really appreciate and respect the work you've done, and apologies if you have already addressed this head on, but with so little talk about the impact on timing here, I feel like you're hiding something or overinflating the actual real world improvements here - what are your thoughts?

It's also mildly concerning me that nobody else has raised this - am I doing something wrong here, or is everyone else just not actually using local models in real life?! Talk to me about your speed experiences!

3 comments

JKCalhoun 24 days ago

"I actively avoid cloud based LLMs… This means I don't really have a good grasp of SOTA LLM performance or accuracy…

…but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed."

I wonder, if you were to use cloud-based LLMs more often, you might find that accuracy (fidelity?) is indeed more more lacking in your local models.

You can always just throw hardware at your speed problems after all.

link

1dom 24 days ago

I agree accuracy isn't maybe the best word here, I used it as it was used in the original post, mainly a as a catchall for "everything but speed", so fidelity, perplexity, etc.

I also agree that if I spent more time using cloud based LLMs, I would very much find local LLMs less capable and useful. Comparison is the thief of joy though, and I'd rather feel blissfully ignorance towards SOTA LLMs rather than a dependence on them.

Before taking a local focus approach, LLMs increasingly left me feeling a mixture of FOMO, sadness and futility towards the future of software and tech. I assume it's 100% a me problem, but it has it's benefits:)

link

JKCalhoun 23 days ago

No, I'm a fan of local as well. For me though, there is just such a fascination that I can have something like this sitting on my own hard drive. It's okay that it's not a "frontier model".

link

zambelli 24 days ago

Hi! Latency is definitely a factor in any system, and the dashboard and paper do report elapsed time - but at the workflow level.

On a per-call basis, the wrappers are pure python ifs and such, measured in ms easily, and frankly negligible compared to the LLM call itself which will be on the order of magnitude seconds.

Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.

I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.

link

1dom 24 days ago

Hi! Thanks for the response. Like I mentioned, I only skimmed, and it sounds like there's more to it than I understand, so I'll take a deeper look and see how it feels in practice.

> Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.

> I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.

Yeah, that makes sense and seems fair. The sort of delays are almost and inevitability, you're not trying to improve speed, but by improving reliability, it can obviously increase overall throughput.

Having watched the demo video too now, automating retries etc would be helpful for me. It's impressive to see how quick the models run on better hardware, and the performance improvements are impressive, even if the overall run takes longer sometimes because it does more correct things. Thanks again!

link

anentropic 24 days ago

> On a per-call basis, the wrappers are pure python ifs and such, measured in ms easily

Ah that's good to know

when I first saw this posted yesterday I was wondering that, kind of assumed maybe it was doing extra LLM calls to make judgements

link

zambelli 24 days ago

Retry nudges do generate an extra LLM call, and those average extra calls time impacts are captured in the eval data.

But that's the difference between the call failing and succeeding (eventually).

On successful calls the presence of forge should be unnoticeable.

link

NooneAtAll3 24 days ago

what does "30a3b" mean?

link

1dom 24 days ago

Yup, confirming what pamcake said, 30b with 3b active.

I have a laptop with a broken screen and an RTX2060 at my disposal. I can run 12b - 14b dense usably, just, although I think 4b - 8b dense models give me the best tradeoff of speed and usefulness.

Larger MOE models with more parameters (20b+) but fewer active (2 - 3b) are sometimes a little bit slower, but are often far more capable.

link

pamcake 24 days ago

Guess: 30B MOA with 3B active

link