I haven't played with the model just yet - but just eye balling it's performance it's significantly worse. I'm surprised they don't have Pythia on there as that's what they're based on from my understanding.
At their performance level it's the most important to compare to GPT-neoX, and I do appreciate they aren't making the "95% of GPT4" claims that some fine-tuned llama models are.
EDIT: For databricks people: I'd love to see this compared with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible.
Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that
At their performance level it's the most important to compare to GPT-neoX, and I do appreciate they aren't making the "95% of GPT4" claims that some fine-tuned llama models are.
EDIT: For databricks people: I'd love to see this compared with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible.