|
|
|
|
|
by hipmanbro
1129 days ago
|
|
There were a few reasoning benchmarks that I noticed think they omitted a direct comparison since they weren't as competitive compared to GPT-4, and instead opted to just show the benchmarks comparing itself to other versions of PaLM or other language models HellaSwag: GPT-4: 95.3%, PaLM 2-L: 86.8% MMLU: GPT-4: 86.4%, Flan-PaLM 2-L: 81.2% ARC: GPT-4: 96.3%, PaLM 2-L: 89.7% (from: GPT-4 paper: https://arxiv.org/pdf/2303.08774.pdf) |
|