| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by seemaze 23 days ago

> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.

I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?

That seems like comparing apples to apple pie, there's some ingredients missing.

2 comments

zambelli 23 days ago

I was surprised as well. I did go with an extreme (but true) example in the post. In this case, native function-calling template likely is in play.

However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile.

The backend affects almost all model families, and was just something I've never seen really talked about.

link

eob 23 days ago

Do you have any suspicion about what is different between the backends?

That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.

link

zambelli 23 days ago

I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE.

I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.

I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.

link

kosolam 22 days ago

Hey, this is most probably related to the chat template or the reasoning parser or the tool call parser or also things like kv cache quantization and possibly other params that affect results like the regular top k top p and all of that, the backend often sets its own defaults or the lack of them. It’s best to have all these under control if possible. I wonder regarding this project have you been testing it on real world projects? I’m working on an agentic loop as well also using a local model.

link

zambelli 22 days ago

Yes I've now used it "in the wild" for a handful of use-cases. I still run into the backend thing even when declaring params though, which is odd to me. But there might be params not typically passed in with the model that backends are setting. Again, really not my area of expertise.

As for consumers, I've done a home assistant, an agentic coding harness, and an autonomous engineering project (still in flight).

link

imachine1980_ 23 days ago

I wouldn't expect such difference

link