| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tinus_hn 1152 days ago
	The replies to this command are top HackerNews: every commenter is a bigger expert than Donald Knuth but nobody quotes any actual benchmark results to go with their theories.

2 comments

tjoff 1152 days ago

It isn't fair to assume the same rigor on a comment.

But, you don't have to be a bigger expert than Knuth to dismiss an optimization done for hardware say 40 years ago (the circumstances around this particular case I don't know).

Even in that case though it might still be relevant for embedded CPUs.

link

tinus_hn 1152 days ago

So it’s just some unfounded handwaving? You can just dismiss one of the greatest minds in computing science by just blathering in a comment, because then it’s ‘unfair to assume rigor’?

If it is all so clear and all the armchair experts here have ample experience in the field like they pretend, why is it so hard to run a few benchmarks?

link

vidarh 1151 days ago

It's not just handwaving. It's knowing the context. As late as the mid 80's it was not unusual for multiplication to take tens of cycles if your CPU even had a built in multiplication instruction (e.g. on the M68000 in the Amiga, a MULU - unsigned 16 bit -> 32 bit multiplication - took from 38 to 70 cycles plus any cycles required for memory access; if you needed to multiply numbers larger than 16 bit to depend on the overflow like in the case of this article, you'd need multiple instructions anyway), far worse if it didn't, while memory loads were often cheap in comparison (on the M68000 a memory read indirect via an address register could be down to 8 cycles), and so until years after that it did make sense to do all kinds of things to reduce the need for multiplication. But it doesn't any more.

While it'd be worthwhile doing tests to confirm a specific case, the default assumptions have changed: Today memory is slow and multiplication fast (in terms of cycles; in absolute terms both are of course far faster).

You certainly should not today pick a more complex hashing scheme to try to avoid a multiplication without carefully measuring it just because it was discussed even by someone as smart as Knuth in a context where the relative instruction costs where entirely different.

If you're actually using the function as the primary hash function, then the distribution of the output might well make up for significant performance difference, so this is not to suggest that tabulation hashing isn't a worthwhile consideration.

link

tjoff 1152 days ago

How is it unfounded? How is it unreasonable to question the relevance of an optimization from a time where the relevant parts of computing were completely different?

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.", Donald Knuth

Maybe doing benchmarks for a comment isn't worthwhile. I guarantee you have to do different benchmarks for different contexts anyway so I can't blindly trust the benchmark anyway. Not to say it wouldn't be interesting.

I wouldn't mind someone plotting the cost of instructions over time and how that affects choice of algorithms. But to expect that from a comment?

link

tinus_hn 1151 days ago

So here’s a paper where the author has run the experiment:

https://arxiv.org/pdf/1011.5200.pdf

And the answer is that the speed is comparable to other functions that however produce worse results.

link

vidarh 1151 days ago

A couple of issues with that interpretation:

1. Dated hardware, so the hashing algo speed comparison is no longer relevant without redoing it, but even on hardware that old a 2.2x-2.8x speed advantage for mul+shift is substantial.

2. No tests with contention for the cache; no tests with different table sizes; no code given. As a result it's impossible to tell if the performance numbers are relevant and realistic.

3. If they could demonstrate substantially better distribution, it might still be very worthwhile despite how much slower it is, but they test the hashing algorithms with runs of 100 random constants. We don't know if any of those constants are any good because they've not given them, but odds are highly against 100 random constants even approaching good. As such the comparisons of tabulation hashing with the other hashing method is meaningless in terms of performance (but see below) - it's trivial to find constants for multiplication + shift that produces pathologically bad outcomes.

What the paper does appear to show is that tabulation hashing might have more predictable runtime given the result on the specific set of structured input they test with, and that might well be a good reason to use it for some applications.

But that is tainted by the lack of transparency in what they've actually compared against.

(This is also mostly relevant if you considering using a multiplication-shift based hash function, which is also not what the original article is advocating you use Fibonacci hash for)

link

tinus_hn 1150 days ago

The hardware is not that dated, they specifically target then modern hardware that has the fast multiplication everyone is on about. And while multiplication has gotten faster, of course the size of the l1 cache has also grown. It is per-core so there is not going to be a whole lot of contention going on. A modern budget cpu has about 32kb per core so it’s not going to be a squeeze.

I am no expert in math but the algorithm is claimed to be better than another one because it is in a class that is better than the class the other is in. It’s not because a run of 100 shows some distribution.

You have a lot of demands for exhaustive testing but when you are asked to provide the same, it’s all too much to ask. ‘I wouldn’t mind someone plotting graphs’ yeah thanks, I wouldn’t mind someone else doing the work.

Then again someone else probably has done the work more recently or more in line with what you want to see. I found this paper in a few minutes or websearching, I’m sure you can spare the time to find a better one.

link

NohatCoder 1150 days ago

It is funny, the article claims to test the 64 bit code on "Dual-core Intel Xeon 2.6 GHz 64-bit processor with 4096KB cache". That is a really poor description, as it does not tell us what architecture the processor is. But one can go through a list of all Intel Xeon processors to find the ones that match the description. Turns out that there are none.

If we broaden the search to 2.66 GHz processors there are 4: 5030, 5150, 3070 and 3075. All released in 2006 and 2007. This means it is either one of the last "NetBurst" CPUs or one of the first "Core" CPUs. Assuming "Core" the relevant operation has a 5 clock latency, as best I can tell. This is down to 3 clocks on pretty much all modern X86 CPUs. Modern CPUs also get an extra load port, so I doubt the relative difference is much different on modern CPUs.

Overall it looks like a pretty bad benchmark, thrown into a paper on collision likelihood, which itself looks like an academic exercise with no relevance for the real world.

link

xxs 1152 days ago

Embedded CPUs would have much worse memory access latency, and a lot less memory to spare - so if anything wasting memory on tables is likely to perform worse as well.

link

tjoff 1152 days ago

How do you mean? Measured in cycles embedded devices typically have less latency to SRAM.

link

xxs 1151 days ago

You're right, I guess. On devices where the memory (sram) tends not to have its own clock (and there is no OOO), it can effectively be one cpu cycle. PIC and ESP-32 comes to mind. If there is 'extra' not on chip memory, it'd be way worse, of course.

link

eru 1152 days ago

Why? Eg https://news.ycombinator.com/item?id=35760954 quotes some (vague) numbers.

link