Hacker News new | ask | show | jobs
by throwaway2037 375 days ago

    > where linear search beats constant time maps
Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.
2 comments

Pick any compiled language and test it. Pick an algorithm making heavy use of a small (<10, maybe up to a hundred elements) hashset, and try using a linear structure instead. The difference will be most apparent with complicated keys, but even strings of more than a few characters should work.

Some example workloads include:

1. Tokenization (checking if a word is a keyword)

2. Interpretation (mapping an instruction name to its action)

3. ML (encoding low-cardinality string features in something like catboost)

4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)

Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.

It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.

When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.

As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.

You've got to keep in mind that computers aren't the 1-instruction-at-a-time purely sequential machines anymore.

Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines

Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.

This is a good point.

A string search algorithm that uses SIMD to do quickly discard a majority of 16, 32 or 64 attempts in parallel, and then verify the surviving ones quadratically (again 16, 32 or 64 bytes at a time) can go a very long way against a sublinear algorithm that understands needle structure, but necessarily needs to process the haystack one byte at a time.