| HN Mirror

Good catch on the numbers. 29/33 vs 33/33 is the kind of gap that could easily be noise with that sample size. You'd need hundreds of runs to draw any meaningful conclusion about a 4-point difference, especially given how non-deterministic these models are.

This is a recurring problem with LLM benchmarking — small sample sizes presented with high confidence. The underlying finding (always-in-context > lazy-loaded) is probably directionally correct, but the specific numbers don't really support the strength of the claims in the article.