I've not done any real benchmarking but the OpenAssistant fine tuning from LAION has been done on it. It worked reasonably well for something local but definitely felt like it wasn't nearly as complete/advanced as any of the ChatGPT stuff. I imagine this Databricks setup is more complete there but I personally wouldn't expect too much more than GPT-3 level performance. That said if this dataset is open (I haven't really looked too much at the article yet) then you could quite easily use it to tune LLaMA just like the stanford alpaca models, which might be a better combo. Though that wouldn't be licensed for commercial use then given the underlying license.
Do we have any quantitative way of benchmarking the quality of these models at all? Like, I don’t care if a model takes one minute per token on my laptop as long as it’s “GPT-4 quality”, and I don’t care if it does 100 tokens per second if it’s straight crap. But every comparison I see people make regarding quality seems to come from “I asked it a couple of my favorite questions and it did… uh, only a little worse than GPT imo”
Perplexity scores are pretty common, which I think involves taking a text corpus like wikipedia and seeing how well the model predicts the next word (token) of it.