Hacker News new | ask | show | jobs
by ekelsen 1023 days ago
Llama2 chat performs worse and wasn't included for that reason.

The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.

1 comments

Could you share the results with standard way of benchmarking(accuracy of top selection). While the approach you guys took is reasonable, but it would be more informative to see to see how much better/worse it is with standard benchmark.