| Apologies, I was a bit curt because this is a well-worn interaction pattern. I don't mean anything by the following either, other than, the goalposts have moved: - This doesn't say anything about generalization, nor does it claim to. - The occurrences of the prefix general* refer to "Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?" - This specific suggestion was accomplished publicly to some acclaim in September - To wit, the benchmark the article is centered around hasn't been updated since since September, because the preview of the large model accomplishing that blew it out of the water, 33% on all at the time, 71%: https://huggingface.co/spaces/allenai/ZebraLogic - these aren't supposed to be easy, they're constraint satisfaction problems, which they point out are used on the LSAT - The major other form of this argument is the Apple paper, which shows a 5 point drop from 87% to 82% on a home-cooked model |