Hacker News new | ask | show | jobs
by blueplanet 3998 days ago
Can someone explain to me why using a cached value for hashCode in an immutable string is a bad thing?
4 comments

I don't think he refers to the caching being a bad thing, but that it probably does not affect performance that much, since most Strings are likely to be small (and the number of times hashcode is called is also likely to be small).

There's also a catch : if the hashcode of the string is 0, the hashcode will be recalculated every time (since the code assumes it has not been cached yet).

>There's also a catch : if the hashcode of the string is 0, the hashcode will be recalculated every time (since the code assumes it has not been cached yet).

At least that part should be easy to fix by defining a hash function that returns numbers != 0 - even the article says they did it for JVM7 but it's gone in JVM8 - with no explanation ?

The hash32 implementation in Java 7 was not intended to fix the case where the actual hash value was zero, nor did it have an impact on that case as the hash32 value was stored in a separate field.

It was made to decrease the number of hash value collisions in large data structures (hash maps and such). Its replacement is described here: http://openjdk.java.net/jeps/180 . Given that most Strings never end up in a large collection, allocating an extra 4 bytes for every one of them was a waste.

This post has been edited once for factual correctness.

It's hard to imagine a situation where the hash function would be so slow as to justify adding 4 bytes to every string.
If your strings are large, then calculating the hash will take time AND the extra 4 bytes for hash storage will be minimal extra overhead.

You can easily find good & bad cases for all of these string implementations. They all have tradeoffs.

Can just bit-or 1.
this halves the number of values. doing if (hashcode == 0) hashcode = 1; removes only one value.
I assume it's because most of the standard Java collections (HashMap, HashSet) actually cache the hash codes themselves. So caching it again inside the object is a bit redundant.

However other data structures (notably the ones in guava), don't do this caching - which does hurt performance for any sort of heavy writes and/or complex object.

It's certainly not a bad thing, why do you think it might be?
It can be a bad thing if all programmers assume that the caching works, and it doesn't. That's where the bit in the article mentions strings that produce a hash code of zero. Zero is used as a sentinel for "hash code not yet calculated", so for those particular strings the code is recalculated every single time.
That's not worse than not caching at al. Hashing implementation usually is a black box. Actually simple if (h == 0) h = 0xdeadbeef would probably prevent that issue from happening (if that should be considered a real issue at all).
The only negative was a small memory overhead.