Hacker News new | ask | show | jobs
by evilduck 666 days ago
Question I wrote:

> I encountered the typo "anablibg" in the sentence "I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages." What did they actually mean?

GPT-4o and Sonnet 3.5 understood it perfectly. This isn't really a problem for the large models.

For local small models:

* Gemma2 9b did not get it and thought it meant "analyzing".

* Codestral (22b) did not it get it and thought it meant "allocating".

* Phi3 Mini failed spectacularly.

* Phi3 14b and Qwen2 did not get it and thought it was "annotating".

* Mistral-nemo thought it was a portmanteau "anabling" as a combination of "an" and "enabling". Partial credit for being close and some creativity?

* Llama3.1 got it perfectly.

5 comments

I wonder if they'd do better if there was the context that it's in a thread titled "Adding 16 kb page size to Android"? The "analyzing" interpretation is plausible if you don't know what 16k pages, kernels, Asahi, etc are.
Seems like there is a bit of a roll of the dice there. The ones that got it right may have just been lucky.
Ran it a few times in new sessions, 0 failures so far.
I wonder how much of a test this is for the LLM vs whatever tokenizer/preprocessing they're doing.
Is there any task Gemma is better at compared to others?
Local LLM topics are a treadmill of “what’s best and what is preferred” changing basically weekly to monthly, it’s a rapidly evolving field, but right now I actually tend to gravitate to Gemma2 9b for coding assistance for Typescript work or general question and answer stuff. Its embedded knowledge and speed on the computers that I have (32GB M2 Max, 16GB M1 Air, 4080 gaming desktop) make for a good balance while also using the computer for other stuff, bigger models limit what else I can run simultaneously and are slower than my reading speed, smaller models have less utility and the speed increase is pointless if they’re dumb.
fwiw I failed to figure it out as a human, I had to check the replies.