Hacker News new | ask | show | jobs
by esprehn 375 days ago
Modern computers are pretty great at scanning small blocks of memory repeatedly, so n^2 can be faster than the alternative using a map in cases for small N.

I spent a lot of time fixing n^2 in blink, but there were some fun surprises:

https://source.chromium.org/chromium/chromium/src/+/main:thi...

For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).

This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.

Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.

3 comments

This is true. But unless you are sure that current and future inputs will always be small I find it is better to start with the algorithm that scales better. Then you can add a special case for small sizes if it turns up in a hot path.

This is because performance is typically less important for the fast/small case and it is generally acceptable for processing twice as much to be twice (or slightly more than twice) as slow, but users are far more likely to hit and really burned by n^2 algorithms in things you thought would almost always be small and you never tested large enough sizes in testing to notice.

I wrote more on this topic here https://kevincox.ca/2023/05/09/less-than-quadratic/

A lot of computations are really higher complexity order functions with stair steps at certain intervals based on hardware trying to pretend they are constant time. All operations cost the same amount until n doubles again and then it’s slower. If you zoom out toward infinity, the stair steps smooth out into a logarithmic curve. It becomes logarithmic in the former case and square root in the latter. Even dividing two numbers or doing a memory address lookup stops being C, which is part of why prime factoring worked for RSA for so long.

If anyone had made clockless logic work you would see that adding 1 + 1 is in fact faster than adding 2^63 + 1.

If you put enough data into a hash table the key length has to increase logarithmically to the table size in order to have distinct keys per record. Even Knuth points out that hash tables are really nlogn - something I’m pretty sure my CS professors left out. In multiple classes. Man, did I get tired of hash tables, but I see now why they harped on them. Case on point, this article.

    > where linear search beats constant time maps
Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.
Pick any compiled language and test it. Pick an algorithm making heavy use of a small (<10, maybe up to a hundred elements) hashset, and try using a linear structure instead. The difference will be most apparent with complicated keys, but even strings of more than a few characters should work.

Some example workloads include:

1. Tokenization (checking if a word is a keyword)

2. Interpretation (mapping an instruction name to its action)

3. ML (encoding low-cardinality string features in something like catboost)

4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)

Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.

It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.

When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.

As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.

You've got to keep in mind that computers aren't the 1-instruction-at-a-time purely sequential machines anymore.

Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines

Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.

This is a good point.

A string search algorithm that uses SIMD to do quickly discard a majority of 16, 32 or 64 attempts in parallel, and then verify the surviving ones quadratically (again 16, 32 or 64 bytes at a time) can go a very long way against a sublinear algorithm that understands needle structure, but necessarily needs to process the haystack one byte at a time.