|
|
|
|
|
by nabakin
930 days ago
|
|
More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them. Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about. In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible. You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study. Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat. |
|
We clearly see that Mistral-7B is in some important, representative respects (eg coding) superior to Falcon-180B, and superior across the board to stuff like OPT-175B or Bloom-175B.
"Well trained" is relative. Models are, overwhelmingly, functions of their data, not just scale and architecture. Better data allows for yet-unknown performance jumps, and data curation techniques are a closely-guarded secret. I have no doubt that a 7B beating our best 60-70Bs is possible already, eg using something like Phi methods for data and more powerful architectures like some variation of universal transformer.