| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thewataccount 1205 days ago

I haven't played with the model just yet - but just eye balling it's performance it's significantly worse. I'm surprised they don't have Pythia on there as that's what they're based on from my understanding.

At their performance level it's the most important to compare to GPT-neoX, and I do appreciate they aren't making the "95% of GPT4" claims that some fine-tuned llama models are.

EDIT: For databricks people: I'd love to see this compared with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible.

1 comments

ankitmathur 1205 days ago

Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that

link