With the exception of task specialization. Fine-tuning a small model such as Mistral 7B on a specific set of tasks can outperform using GPT-4 on those tasks, and with cheaper and faster inference.
Not on the leaderboards mentioned here. That’s my point, you can overfit for specific tasks, you can’t beat them on multi-task leaderboards without training on the test data.