Hacker News new | ask | show | jobs
by appplication 914 days ago
I mean this isn’t too surprising that smaller models do better. I imagine transformers are as prone to overfitting as any statistical data model. Also there is probably some selection bias: bigger models are more expensive and there are just less people training and iterating with them
1 comments

There are orders of magnitude fewer people playing with large (>40B) parameter models than the small ones, which means even fewer people finetuning those models.

I can’t imagine this is anything but selection bias.

> which means even fewer people finetuning those models.

Finetunes rarely led to "Top 5 performance" for the small ones. Previously the top 10+ were all 70B, with maybe a few 30B in there. There were nearly no 13B's, let alone 7B.

The Zephyr-7b-β was one of the best 7B mistral 0.1 finetunes the past month and a half, and that didn't beat most 70B's.

Even at 7B there are few foundational models as even those take a relatively large amount of money. The only decent one for months has been 7B mistral which again didn't come that close to 70B performance.