|
|
|
|
|
by cpard
60 days ago
|
|
Benchmarks/evals are really hard and they become harder when there’s huge incentive to game them at an industry scale. ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago. A few days ago, a follow-up paper from a group that includes one of the original authors audited the benchmark itself. The team gfound that the benchmark has structural issues that biased results. Here’s the paper: https://arxiv.org/abs/2603.29399 None of these are new though, the industry has gone through all that before just in a smaller scale and there’s a lot to learn from that. Here’s a post I wrote on the parallels we see today to what happened with the benchmarketing wars of the database systems. https://www.typedef.ai/blog/from-benchmarketing-to-benchmaxx... |
|
You need new datasets perpetually.