Hacker News new | ask | show | jobs
by saurik 3397 days ago
> Some language implementations have hash tables built-in; some provide a hash table as part of a collections library; some use a third-party hash table library. (For example, use either khash or CK_HT for C language k-nucleotide programs.) The hash table algorithm implemented is likely to be different in different libraries.

> Please don't implement your own custom "hash table" - it will not be accepted.

> The work is to use the built-in or library hash table implementation to accumulate count values - lookup the count for a key and update the count in the hash table.

The C++ implementation is thereby testing an old version of a non-standard extension of libstdc++ that I had never heard of and which was likely contributed once by IBM and never really looked at again (by either maintainers or users ;P), while the C implementation is testing the specified khash library, which is apparently something a number of people actively contribute to and attempt to optimize, giving it some notoriety.

If I were to do this in C++, and I wasn't allowed to use my hash table, I would almost certainly not be using __gnu_pbds::cc_hash_table<>. If I were to just want to use something "included", I would use the C++11 std::unordered_map<> (note that this code is compiled already as C++11). But we all know the STL is designed for flexibility and predictability, not performance, and the culture of C++ is "that's OK, as if you actually care about performance no off-the-shelf data structure is going to be correct". If I decided I wanted speed, I know I'd want to check out Folly, and I might even end up using khash.

Reading other comments, what happened here is the Rust version is now using some "experimental" hash table based on ongoing work to optimize the Python 3000 dict implementation. This is just not a useful benchmark. What we are benchmarking is "how maintained is the implementation's built in hash table and is it tunable for this particular workload".

That's why you should not be surprised to see Java doing so well: the code actually being written here is just some glue... your programming language has to be incompetent to do poorly at this benchmark (especially as many commenters here are using a "within a power of 2" rule of thumb). There are even multiple listings for the Java one, and the one that is faster is using it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap?!?

What we really should be asking here is: why is any language doing "poorly" in this benchmark? It just isn't surprising that Rust is competitive with C/C++, nor is it surprising that Java is also; what is surprising is that Swift, Go, Haskell, and C# are not, and so I bet the issue is something (such as "is allocating memory for a thing which is not required") that can be trivially fixed for each (though by the rules of engagement, it might... or might not :/ as Java "cheated", right? ;P... require a minor fix upstream).

I mean, since the "work" explicitly is not "write a hash table using nothing but primitives from this language", there is no particular reason why Perl and Python (which is using a non-destructive array .replace, which is likely brutal... again: I bet this is almost always a benchmark of ancillary memory allocations) should be doing as poorly as they are: if we all made "optimize for this benchmark" a top priority for a weekend hackathon, I bet we could get every open source language to nail this under 25s. But do we care?

3 comments

> If I were to do this in C++ … I would use the C++11 std::unordered_map<> …

http://benchmarksgame.alioth.debian.org/play.html#contribute

The thesis of my comment was that we should be surprised that any language does poorly on this benchmark, particularly ones that have similar kinds of targeting, and that if we cared about this benchmark (and I claim we don't), we should all pitch in, possibly upstream to fix various languages and their standard libraries to nail this benchmark. However, I also believe the rules of this benchmark are awkward and even flawed, and that it isn't clear to me that it is worth anyone's time to do that.

I am not lamenting that someone should spend more time on this: I am lamenting that tons of people seem to care about it at all, it is not a "microbenchmark" (as some are calling it), and I think the main lesson we can learn from it is "there is something subtely wrong, either with the implementation that was contributed for this benchmark, the language's runtime, or it's standard library", as given these rules we really should expect every language to be similarly in performance.

And so, if we all cared about this benchmark, I bet we could figure out what is going on and get every open source language down under 25s. Past that point, I think the rules are such that this isn't even a fun game much less a useful metric of anything worth measuring, and you are probably wasting your time contributing. I guess, to make this subthread go somewhere: why do you disagree?

> it is not a "microbenchmark"

Home page -- "Will your toy benchmark program be faster if you write it in a different programming language? It depends how you write it!"

http://benchmarksgame.alioth.debian.org/

Whatever the motivation for your comments, they did remind me that I'd intended to have background information URLs on other pages (not just the home page).

So thanks for that!

> it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap?!?

That's just a library that provides data structures without boxing of int,double... in Java. This eliminates a lot of overhead.

Oh, I absolutely understand why it is interesting and fast: what I don't understand is how it satisfies the rules, as it is effectively "some random project with a hashtable". Is this particular one so famous that it should be allowed instead of java.util.HashMap? Can I just publish my C++ hashtable and then rely on it, in which case the rule makes no sense? That's why I tack on "?!".
> instead of java.util.HashMap

Not "instead of" as-well-as.

This entire subthread is clearly a digression ;p, but why do you say that? The rules were about using a built-in or standard collection: this Java implementation does not use HashMap and instead uses a random project with a better data structure for this use case. This is a loophole to bypass the restriction on writing your own hashtable: you just have to publish it first? ;P
Those are different implementations; I know those exist, and which is why I had specifically said there were multiple implementations and pointed at the "faster one", which is obviously the one most people are going to be paying attention to, and which is the one that is also most relevant as we could and probably should expect the implementations for other languages to use some crazy one-off library: it demonstrates and undermines the limitations placed on the various languages seem arbitrary enough that you can't read into these results as being indicative of the languages themselves, but simultaneously that we shouldn't even care as the fast Java entry effectively "cheated".

I am going to take a step back: you seem irked by my comments, due to the knee-jerk link to have me contribute a different implementation for C++ as well as arguing with the short reply what I really maintain is a nearly-offtopic point about the Java library in question here (both of which I am reading as slightly aggressive towards me or my comment).

Do you really like this benchmark? Are you super excited that Rust is slightly faster right now than C/C++/Java? What is going through your head when you read my comments? Do you disagree that we should expect all native-ish languages to do equally-enough well for most people given these constraints? Is it that my comment is coming off to you as "no fun allowed", as I am trying to point out that this is a meaningless competition that mostly teaches us about different things than "who is winning"?

In fact, looking at your recent comment history, I am realizing you are being super aggressive about this with everyone, and are nigh unto spamming the play.html link to everyone who tries to comment about the high-level idea of "what are we doing here". What's up?