Hacker News new | ask | show | jobs
by imjonse 1023 days ago
Congrats on the release! Two questions.

1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?

2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?

1 comments

Llama2 chat performs worse and wasn't included for that reason.

The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.

Could you share the results with standard way of benchmarking(accuracy of top selection). While the approach you guys took is reasonable, but it would be more informative to see to see how much better/worse it is with standard benchmark.