|
|
|
|
|
by amilios
990 days ago
|
|
Can anyone corroborate this anecdotally? I.e. has anyone actually looked at the output of the two models side-by-side for common tasks? There's lots of talks these days about academic benchmarks being pretty "broken" for modern LMs, and not really properly showcasing the differences between models. I wonder if that's the case here or if the model is genuinely better. |
|
- Huggingface is less likely to "cheat" by training on tests than other orgs, I think.
- Some finetunes are really good at a particular test (like XWin). This isnt necessarily a bad thing, if they are good at a specific niche.