| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zambelli 23 days ago

I was surprised as well. I did go with an extreme (but true) example in the post. In this case, native function-calling template likely is in play.

However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile.

The backend affects almost all model families, and was just something I've never seen really talked about.

1 comments

eob 23 days ago

Do you have any suspicion about what is different between the backends?

That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.

link

zambelli 23 days ago

I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE.

I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.

I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.

link

kosolam 23 days ago

Hey, this is most probably related to the chat template or the reasoning parser or the tool call parser or also things like kv cache quantization and possibly other params that affect results like the regular top k top p and all of that, the backend often sets its own defaults or the lack of them. It’s best to have all these under control if possible. I wonder regarding this project have you been testing it on real world projects? I’m working on an agentic loop as well also using a local model.

link

zambelli 22 days ago

Yes I've now used it "in the wild" for a handful of use-cases. I still run into the backend thing even when declaring params though, which is odd to me. But there might be params not typically passed in with the model that backends are setting. Again, really not my area of expertise.

As for consumers, I've done a home assistant, an agentic coding harness, and an autonomous engineering project (still in flight).

link