Hacker News new | ask | show | jobs
by kaydub 15 days ago
I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

I flip between models all the time. Makes little difference. Sometimes one model is faster or better than another but there's no rhyme or reason why.

2 comments

> I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

We benchmark non-deterministic things all the time and it's frankly not even that unusual or hard. You yourself indicate that one model outperforms another one in your experience on various facets, and that is itself a benchmark.

The more relevant question is probably how well does a given benchmark translate to improvement on a specific desired outcome or task. The military uses the ASVAB testing battery to benchmark potential new recruits for suitability in various career specialties, but the actual outcome the benchmark is meant to correlate with is later success in the training pipeline.

So every so often the various military branches have to do and compare ASVAB results against training results and make sure that they still have a predictive relationship.

And this is benchmarking real flesh-and-blood human beings where you get on the order of magnitude of a million data points or so per year. You can benchmark AIs much more efficiently than that, as non-deterministic as they are, and as long as the benchmark itself is reasonably predictive of outcome it's going to be useful information.

My anecdotal experience isn't a benchmark. Just because I feel like something is better or different doesn't mean it actually is.
> Just because I feel like something is better or different doesn't mean it actually is.

Of course, but it is a data point, and multiple such data points can be aggregated. This is true even if all you can do is compare two things.

The shape of that data will reveal something more about the thing you want to measure than the null hypothesis you'd otherwise have.

All tools are non-deterministic on some reasonably specified input set.