Hacker News new | ask | show | jobs
by hipmanbro 1129 days ago
There were a few reasoning benchmarks that I noticed think they omitted a direct comparison since they weren't as competitive compared to GPT-4, and instead opted to just show the benchmarks comparing itself to other versions of PaLM or other language models

HellaSwag: GPT-4: 95.3%, PaLM 2-L: 86.8%

MMLU: GPT-4: 86.4%, Flan-PaLM 2-L: 81.2%

ARC: GPT-4: 96.3%, PaLM 2-L: 89.7%

(from: GPT-4 paper: https://arxiv.org/pdf/2303.08774.pdf)