| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by igouy 53 days ago

> it's just not good comparison of language speeds

It's not that the benchmarks game is not a good benchmark suite, it isn't a benchmark suite.

It's not that the benchmarks game is not a good comparison of language speeds, it's that comparison of "language speeds" is so under-specified as-to-be wishful thinking.

> Java was designed to…

"… build software for the next generation of consumer electronics – think smart toasters, interactive TVs, and other futuristic gadgets." Things change.

>… the very things that low-level languages have always been good at…

Which is why there are people who find those kind-of Java programs being in-any-way comparable, somewhat surprising.

1 comments

pron 53 days ago

> It's not that the benchmarks game is not a good benchmark suite, it isn't a benchmark suite.

OK, but I was responding to someone who did consider it to be a benchmark suite. As long as we agree it's not a good benchmark suite whatever it considers itself to be, we're in agreement.

> It's not that the benchmarks game is not a good comparison of language speeds, it's that comparison of "language speeds" is so under-specified as-to-be wishful thinking.

With that I completely agree. But if you group results by language, that's exactly what you're inviting, and if your suite of benchmarks or whatever you want to call it covered a wider range of problems, that point could be more easily seen. Let's say that the combination of grouping results by language and covering only a very narrow (and niche) set of problems that also happens to be the sweet spot of some languages that have other significant performance failings in other use cases doesn't exactly help people get the right impression.

link

igouy 53 days ago

> As long as we agree…

Close enough.

> … help people get the right impression.

The target audience wonder "Which programming language is fastest?"

A table or chart sorted by elapsed time is the answer they expect.

The target audience have various (perhaps un-examined) ideas about the question.

The sources and measurements can be a way to examine and discuss some of those ideas.

link

pron 52 days ago

Ok, but if the measurements were wider in scope they could at least offer a more interesting, well-rounded, and perhaps even relevant basis for discussion (even if the other flaws, which are harder to fix, remained).

link

igouy 52 days ago

Once upon a time, I might have imagined that would be so. Now it seems more like squeezing a lemon, there's hardly any more after the first squeeze.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

link

pron 51 days ago

I think you're referring to the less important point I made. Correcting for apples-to-apples is harder and less valuable. Having more domain coverage is easier and more valuable (especially since the current coverage is so narrow and largely irrelevant to most software).

BTW, what we do is compare our suite of micro-benchmarks to our (much smaller) suite of macro-benchmarks. This way we get at least some sense of how relevant the microbenchmarks are (i.e. we're looking at the correlation of the deltas). Some microbenchmarks are more correlated with the macrobenchmarks than others. If an optimisation helps some microbenchmarks that we think are not representative of many programs and doesn't help with any macrobenchmark - we take it out.

Just to give an example, we may want to measure some optimisation that helps some allocation pattern. Sometimes it turns out that if that pattern is diluted by other allocation patterns the program does for other tasks, the advantage is completely erased. Some optimisations in free-list allocators are particularly susceptible to this: if your program allocates only in this specific way, it will be super fast. If, in addition, there are some sporadic allocations that follow a different pattern, then after an hour you'll see performance start to drop.

link

igouy 50 days ago

> apples-to-apples

Hopefully, some of the target audience might try to confirm that programs are what they think of as "comparable".

> Having more domain coverage is easier and more valuable…

So where are the examples of that being done? (It's been decades.)

link