Hacker News new | ask | show | jobs
by charlieyu1 310 days ago
Even the benchmarks for maths only checked numerical answers for ground truth, which means the LLM can output a lot of nonsense and guess the correct answer to pass it