| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pron 42 days ago

I think you're referring to the less important point I made. Correcting for apples-to-apples is harder and less valuable. Having more domain coverage is easier and more valuable (especially since the current coverage is so narrow and largely irrelevant to most software).

BTW, what we do is compare our suite of micro-benchmarks to our (much smaller) suite of macro-benchmarks. This way we get at least some sense of how relevant the microbenchmarks are (i.e. we're looking at the correlation of the deltas). Some microbenchmarks are more correlated with the macrobenchmarks than others. If an optimisation helps some microbenchmarks that we think are not representative of many programs and doesn't help with any macrobenchmark - we take it out.

Just to give an example, we may want to measure some optimisation that helps some allocation pattern. Sometimes it turns out that if that pattern is diluted by other allocation patterns the program does for other tasks, the advantage is completely erased. Some optimisations in free-list allocators are particularly susceptible to this: if your program allocates only in this specific way, it will be super fast. If, in addition, there are some sporadic allocations that follow a different pattern, then after an hour you'll see performance start to drop.

1 comments

igouy 42 days ago

> apples-to-apples

Hopefully, some of the target audience might try to confirm that programs are what they think of as "comparable".

> Having more domain coverage is easier and more valuable…

So where are the examples of that being done? (It's been decades.)

link

pron 40 days ago

> So where are the examples of that being done?

Whenever people want to get valuable information. As I said, we in OpenJDK have a couple hundred benchmarks, some macro, many micro, which are meant to give a decent coverage of the things that affect performance.

If a website wants to group results by languages, it should think about performance from the perspective of how languages work (which include compilers, linkers, and runtimes).

For example, what compiler/linker optimisations are done can depend a lot on whether the program is in a single compilation unit or multiple (and in the case of C and C++ - it does).

On the runtime front, think about memory management. These mechanisms often have different behaviour depending on whether the objects are of similar size or not, whether they're allocated and freed by multiple threads or a single one, and whether the heap is "young" and unfragmented or old and fragmented.

Another area in runtimes is data structures. Are they single-threaded or concurrent, and if concurrent, how do they behave under low and high contention?

Some mechanisms, in all of these levels, have great performance under some conditions and not so great performance in others, and sometimes where they perform great is actually a condition that is encountered less often in real programs.

If you're asking what multi-lingual benchmark suites offer good coverage - I don't know. But that we don't have good information doesn't mean that it's good to offer bad information. Imagine that in American presidential elections there were no national polls and no polls in most states. Would having a poll only in Alabama or only in California offer good insight into who's likely to win? Probably not, because such a poll offers a very partial view of the situation. Is it better than nothing? Maybe, but not by much, because the outcome in Alabama and California is easy to predict without any polls, so it's only helpful in the most extreme cases.

My point is that bad information is bad information, and if people don't understand how different languages behave under different conditions (e.g. that the optimisations the compiler does can differ depending on whether the program is in a single file or not) then they can get the wrong impression. Imagine that someone has no idea about the regional polarisation in the US, and you tell them, well, there are 50 states, but since we don't have polls for all of them, here's the poll for Alabama. Is that information helpful at all?

In any event, any increase in the coverage makes the information a little better, and because the audience may not know whether multiple benchmarks exercise the same or different behaviour in the language, it's the role of the website to pick problems that trigger the different codepaths in the languages' infrastructure. Otherwise, there's the wrong impression of variety, like saying we don't poll only in Alabama but also in Mississippi. Or it's like testing the structure of a bridge by driving a car across it, and then doing it with ten different car models. Testing a bridge does require variety, but the different car models are not what triggers different conditions for the bridge.

link

igouy 39 days ago

> If you're asking what multi-lingual benchmark suites offer good coverage - I don't know.

In which case, given: The target audience wonder "Which programming language is fastest?" :there doesn't seem to be support for your claim that: "Having more domain coverage is easier and more valuable…".

> Is it better than nothing? Maybe, but not by much…

The benchmarks game: provisional and modest.

link

pron 39 days ago

> There doesn't seem to be support for your claim that: "Having more domain coverage is easier and more valuable…".

Since programming languages specifically optimise differently for the different conditions I listed above, the "support" for my "claim" is that it's obviously true. No one who implements languages or runtimes will dispute it.

But I don't understand the logical implication. The existence or nonexistence of good information says nothing about the value of the information we do have. If all you know is how much cash I have in my wallet, the fact that no one has ever published how much money I have in my bank account doesn't make the information you have more relevant as an estimate of my wealth. That information is irrelevant regardless of whether or not you have access to the relevant information. That information being available is not what's needed to "support" my "claim" that what you know is irrelevant. All you need to know is how people keep their money.

> The benchmarks game: provisional and modest.

I would say it's more like a website comparing US presidential candidates through polls only in Alabama. A more appropriate description than "provisional and modest" would be that it doesn't actually give us valuable information about the candidates' chances.

If people know how US elections work, such information could be put in context, but I don't know how many programmers understand how languages and runtimes optimise performance. Merely saying it's partial/provisional/modest is insufficient to give people the appropriate context.

link

igouy 39 days ago

Sorry, I'm just not interested in wading through your analogies.

link

pron 33 days ago

My point is that the website offers a very low coverage of the codepaths that affect language performance, and all it says is that the coverage is incomplete without saying just how incomplete it is. People who don't know that, for example, optimisation in some languages can vary greatly depending on whether there's one compilation unit or several (especially since most/all examples have one compilation unit while virtually all programs have many) and that memory management depends heavily on the variety of object sizes and may degrade over time (especially since the examples all run for a short time and have a low variety of object lifetime and size while most real programs are different), can come to very wrong conclusions.

Sorry for yet another analogy, but it's just like me telling my boss that my test suite passes but it's incomplete without telling him that I've tested only 20% of the program's functionality.

The website should at least explain just how skewed the results are in favour of conditions that arise in small, short-lived programs without (competitive) concurrency and that larger and/or longer-lived and/or concurrent programs can exhibit very different behaviour. These conditions aren't a small matter. They're among the primary motivations for the huge investment over the last few decades in moving GCs and JIT compilers.

link