Hacker News new | ask | show | jobs
by ultrablack 1152 days ago
Multiplication is bad. Knuth actually also describes a hash function using random numbers and XOR. Its 10x faster than the modulo, and I belive Mikkel Thorup proved it optimal.

The idea is roughly:

Say you have a hashtable of size 1024. You then create x uint arrays of size size 256. These arrays you fill up with random numbers 0-1023.

To get your hash value, you take your input and for i=0..x-1 determine byte k=input[i] you lookup the value in array[i][k]. These lookup values are then XORed giving a final random value between 0-1023 ready for inserting into the hash array.

No modulos. No multiplications. You only have to redo the random tables when the size changes from say 1024 to 2048. Easy peacy. Superfast.

8 comments

“Superfast”, until you blow through your L1 cache, which happens pretty early on if you need 1 kB of table per byte in your key.

Even in the L1 cache, it's hard to beat the mul: A multiplication (which can hash multiple bytes in the case of Fibonacci hashing) has 3 cycles latency on modern x86. A single load, even from L1, is 5, I believe.

How much of a problem 3 cycles of latency is depends on what else your processor is doing. It might not be any problem at all.
Word. It depends on access patterns of course but yeah, L1 is a valuable resource.
Modern CPU cores can perform a multiplication and addition every clock tick. Heck, I'd expect a modern Zen4 core to be able to do like 4 parallel 64-bit multiplications per clock tick on it's integer pipelines, and maybe 32x parallel 32-bit multiplications per clock tick on it's vector pipelines.

Multiplications we're bad 40 years ago, but the year 2020 called and FMAC is incredibly optimized today.

You should still avoid integer division (floating point division is commonly optimized as reciprocal and then multiply). But multiplications are really really fast at least as far back as 2008 or so.

-------

I'm pretty sure multiplication's latency is only 5 clocks, but with all the out of order processing that occurs on modern cores, latency of just 5 ticks is rarely is the bottleneck. (A DDR4 memory load is like 200+ cycles of latency. You shouldn't even worry about 5 cycles like multiplication, especially because those out of order cores will find some work to parallelize in that time).

-----

> you lookup the value in array[i][k]

You know a L1 cache lookup these days is like 4 cycles of latency right? And I'm pretty sure you have fewer load/store units than multiplication units. So a load/store, even to L1 cache, might use more resources than the multiply.

Might, I'd have to benchmark to be sure.

Indeed, division just doesn't have a parallel algorithm, unlikely mul and add. So it's bound to be 'slow'. About 2008 - Intel core 2 (2006) had 3 cycle mul. Edit: Pentium Pro(1995)'s imul was 4 cycles. 386's imul was slow, though.
This sounds like zobrist hashing, or related . https://en.wikipedia.org/wiki/Zobrist_hashing

" Zobrist hashing is the first known instance of the generally useful underlying technique called tabulation hashing. "

so to https://en.wikipedia.org/wiki/Tabulation_hashing

"

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; [...]

Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: [...]

Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including hopscotch hashing, cuckoo hashing, and the MinHash technique for estimating the size of set intersections.

"

further

" Method: The basic idea is as follows:

First, divide the key to be hashed into smaller "blocks" of a chosen length. Then, create a set of lookup tables, one for each block, and fill them with random values. Finally, use the tables to compute a hash value for each block, and combine all of these hashes into a final hash value using the bitwise exclusive or operation.[1]

"

How come mul is bad? It is a low cycle latency - Skylake had 3cycles per imul, mul r32 - a single cycle. Div is bad but mul is great.

edit: Memory access (along with div) is pretty much the only slow operation in modern CPUs -- extra pressure on L1 just to have random number is not smart at any rate, heck Marsaglia's xor (random) is likely cheaper than accessing L1, very likely all the latency to be hidden behind the memory access.

Multiplication was bad on decades old CPU's.
Others have hinted at this, but to be clear: This algorithm is slow, even in the optimal case where the tables are in cache. On new X86 CPUs it is theoretically limited to less than 2 bytes per cycle. Probably somewhere around 1.5 for an implementation that loads 8 bytes of input at once and shift though them in order to limit load on the load slots.

Even without getting into SIMD algorithms you could load 8 bytes at a time and pretty easily go faster than that, possibly while using the multiplication instruction for mixing.

This of course ignores that we are not looking up values from a hash table in a vacuum. Other code will also be competing for the cache, and that generally means that everything runs slower because of more cache misses.

How do you fill the array though? Wouldn't filling it with random numbers give you a different hash each time you rebuild the hash function? I can see it being useful for a short-lived data structure, but you wouldn't be able to use it as a shared deterministic hash function?
For in-memory tables you rarely need determinism across instances, let alone runs.
Why couldn't you fill it deterministically?
I guess that was my question indeed. In the sense of how do you do it in practice? I suppose there are pseudorandom algorithms that can be easily applied.
Pseudorandom bit sequences PRBS are deterministic and very easy to implement (just linear feedback shift registers).

Something similar is actually done in communication systems, with scrambling, to prevent long strings of transmitted ones or zeros (which cause issues for some of the hardware components). Essentially you just add or multiply the data with the PRBS sequence. At the receiver you just do the reverse operation.

The replies to this command are top HackerNews: every commenter is a bigger expert than Donald Knuth but nobody quotes any actual benchmark results to go with their theories.
It isn't fair to assume the same rigor on a comment.

But, you don't have to be a bigger expert than Knuth to dismiss an optimization done for hardware say 40 years ago (the circumstances around this particular case I don't know).

Even in that case though it might still be relevant for embedded CPUs.

So it’s just some unfounded handwaving? You can just dismiss one of the greatest minds in computing science by just blathering in a comment, because then it’s ‘unfair to assume rigor’?

If it is all so clear and all the armchair experts here have ample experience in the field like they pretend, why is it so hard to run a few benchmarks?

It's not just handwaving. It's knowing the context. As late as the mid 80's it was not unusual for multiplication to take tens of cycles if your CPU even had a built in multiplication instruction (e.g. on the M68000 in the Amiga, a MULU - unsigned 16 bit -> 32 bit multiplication - took from 38 to 70 cycles plus any cycles required for memory access; if you needed to multiply numbers larger than 16 bit to depend on the overflow like in the case of this article, you'd need multiple instructions anyway), far worse if it didn't, while memory loads were often cheap in comparison (on the M68000 a memory read indirect via an address register could be down to 8 cycles), and so until years after that it did make sense to do all kinds of things to reduce the need for multiplication. But it doesn't any more.

While it'd be worthwhile doing tests to confirm a specific case, the default assumptions have changed: Today memory is slow and multiplication fast (in terms of cycles; in absolute terms both are of course far faster).

You certainly should not today pick a more complex hashing scheme to try to avoid a multiplication without carefully measuring it just because it was discussed even by someone as smart as Knuth in a context where the relative instruction costs where entirely different.

If you're actually using the function as the primary hash function, then the distribution of the output might well make up for significant performance difference, so this is not to suggest that tabulation hashing isn't a worthwhile consideration.

How is it unfounded? How is it unreasonable to question the relevance of an optimization from a time where the relevant parts of computing were completely different?

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.", Donald Knuth

Maybe doing benchmarks for a comment isn't worthwhile. I guarantee you have to do different benchmarks for different contexts anyway so I can't blindly trust the benchmark anyway. Not to say it wouldn't be interesting.

I wouldn't mind someone plotting the cost of instructions over time and how that affects choice of algorithms. But to expect that from a comment?

So here’s a paper where the author has run the experiment:

https://arxiv.org/pdf/1011.5200.pdf

And the answer is that the speed is comparable to other functions that however produce worse results.

A couple of issues with that interpretation:

1. Dated hardware, so the hashing algo speed comparison is no longer relevant without redoing it, but even on hardware that old a 2.2x-2.8x speed advantage for mul+shift is substantial.

2. No tests with contention for the cache; no tests with different table sizes; no code given. As a result it's impossible to tell if the performance numbers are relevant and realistic.

3. If they could demonstrate substantially better distribution, it might still be very worthwhile despite how much slower it is, but they test the hashing algorithms with runs of 100 random constants. We don't know if any of those constants are any good because they've not given them, but odds are highly against 100 random constants even approaching good. As such the comparisons of tabulation hashing with the other hashing method is meaningless in terms of performance (but see below) - it's trivial to find constants for multiplication + shift that produces pathologically bad outcomes.

What the paper does appear to show is that tabulation hashing might have more predictable runtime given the result on the specific set of structured input they test with, and that might well be a good reason to use it for some applications.

But that is tainted by the lack of transparency in what they've actually compared against.

(This is also mostly relevant if you considering using a multiplication-shift based hash function, which is also not what the original article is advocating you use Fibonacci hash for)

It is funny, the article claims to test the 64 bit code on "Dual-core Intel Xeon 2.6 GHz 64-bit processor with 4096KB cache". That is a really poor description, as it does not tell us what architecture the processor is. But one can go through a list of all Intel Xeon processors to find the ones that match the description. Turns out that there are none.

If we broaden the search to 2.66 GHz processors there are 4: 5030, 5150, 3070 and 3075. All released in 2006 and 2007. This means it is either one of the last "NetBurst" CPUs or one of the first "Core" CPUs. Assuming "Core" the relevant operation has a 5 clock latency, as best I can tell. This is down to 3 clocks on pretty much all modern X86 CPUs. Modern CPUs also get an extra load port, so I doubt the relative difference is much different on modern CPUs.

Overall it looks like a pretty bad benchmark, thrown into a paper on collision likelihood, which itself looks like an academic exercise with no relevance for the real world.

Embedded CPUs would have much worse memory access latency, and a lot less memory to spare - so if anything wasting memory on tables is likely to perform worse as well.
How do you mean? Measured in cycles embedded devices typically have less latency to SRAM.
You're right, I guess. On devices where the memory (sram) tends not to have its own clock (and there is no OOO), it can effectively be one cpu cycle. PIC and ESP-32 comes to mind. If there is 'extra' not on chip memory, it'd be way worse, of course.
Why? Eg https://news.ycombinator.com/item?id=35760954 quotes some (vague) numbers.