Hacker News new | ask | show | jobs
by gessha 492 days ago
Are they though? They’ve been shown to generalize poorly to tasks where you switch up some of the content.
2 comments

Here's an example. I'm interested in obscure conlangs like Volapük. I can feed a LLM (which had no idea what Volapük was), a English-language grammar of Volapük and suddenly it can translate to and from the language. That couldn't work with a chess program. I couldn't give it a rule book of Shogi and have it play that.
That's not true
They don’t generalize well on logic puzzles

https://huggingface.co/blog/yuchenlin/zebra-logic

Apologies, I was a bit curt because this is a well-worn interaction pattern.

I don't mean anything by the following either, other than, the goalposts have moved:

- This doesn't say anything about generalization, nor does it claim to.

- The occurrences of the prefix general* refer to "Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?"

- This specific suggestion was accomplished publicly to some acclaim in September

- To wit, the benchmark the article is centered around hasn't been updated since since September, because the preview of the large model accomplishing that blew it out of the water, 33% on all at the time, 71%: https://huggingface.co/spaces/allenai/ZebraLogic

- these aren't supposed to be easy, they're constraint satisfaction problems, which they point out are used on the LSAT

- The major other form of this argument is the Apple paper, which shows a 5 point drop from 87% to 82% on a home-cooked model