Hacker News new | ask | show | jobs
by LgWoodenBadger 376 days ago
IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything I’ve written.

I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.

7 comments

Bruce Dawson says: I like to call this Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...

The second law is that O(n * log n) is for practical intents and purposes O(n).
Skiena has a great table in his algorithms book mapping time complexity to hypothetical times for different input sizes.

For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.

more from table please?
I would rather have the table and related content. Name of the book?
It's probably The Algorithm Design Manual 2ed by Steven S. Skiena, figure 2.4

The second table on this [1] page is pretty similar, though not the same.

[1] https://a1120.cs.aalto.fi/notes/round-efficiency--bigoh.html

To be clear though, that isn't his second law, at least as of two months ago, according to https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
Fair, but `n log n` definitely is the historical "good enough to actually sleep at night" in my head, every time I see it I think of the prof who taught my first CSC course and our data structures course due to how often it came up.

Also, the wise statement that 'memory is fairly cheap compared to CPU for scaling'. It's insane to see how often folks would rather manually open and scan a 'static-on-deploy' 20-100MB Json file for each request vs just parsing it into structures in memory (where, for most cases, the in memory usage is a fraction of the json itself) and just caching the parsed structure for the length of the application.

Not often but occasionally I will chose the nlogn algorithm which obviously has no bugs over the O(n) algorithm with no obvious bugs.

Less brittleness is worth paying a few percent. Especially if it unmuddies the waters enough for someone to spot other accidental (time) complexity.

Considerably more than a few percent, IMHO. :)

But I also don't dabble in this area nearly enough to know whether there's years of tears and toil finding out repeatedly that O(n) is ~impossible to implement and verify :)

  | n   | n log n  |
  | 5   | 8.0472   |
  | 10  | 23.0259  |
  | 25  | 80.4719  |
  | 50  | 195.6012 |
  | 100 | 460.5170 |
It's also often in the range where constant factors can make a big difference over a wide range of n
Yes, that isn't actually Dawson's second law.
But sometimes a big enough C can flip which solution helps you hit your margins.
In my mind, that's always been the point in dropping log factors. The algorithms are comparable enough that the actual implementation starts to matter, which is all we're really looking for in a Big-O analysis.
I made the “mistake” in an interview of equating two super-quadratic solutions in an interview. What I meant was what Dawson meant. It doesn’t matter because they’re both too ridiculous to even discuss.
They’re too ridiculous… unless a more optimal solution does not exist
Absolutely not.

If the cost of doing something goes above quadratic, you shouldn't do it at all. Because essentially every customer interaction costs you more than the one before. You will never be able to come up with ways to cover that cost faster than it ramps. You are digging a hole, filling it with cash and lighting it on fire.

If you can't do something well you should consider not doing it at all. If you can only do it badly with no hope of ever correcting it, you should outsource it.

Chess engines faced worse than quadratic scaling and came out the other side…

Software operates in a crazy number of different domains with wildly different constraints.

I believe hinkley was commenting on things that are quadratic in the number of users. It doesn't sound like a chess engine would have that property.
All of modern Neural Network AI is based on GEMM which are O(n^2) algorithms. There are sub-cubic alternatives, but it's my understanding that the cache behavior of those variants mean they aren't practically faster when memory bound.

n is only rarely related to "customers". As long as n doesn't grow, the asymptotic complexity doesn't actually matter.

The GEMM is O(n^3) actually. Transformers are quadratic in the size of their context window.
That only matters when the constants are nontrivial and N has a potential to get big.

Not every app is a B2C product intending to grow to billions of users. If the costs start out as near-zero and are going to grow to still be negligible at 100% market share, who cares that it's _technically_ suboptimal? Sure, you could spend expensive developer-hours trying to find a better way of doing it, but YAGNI.

I just exited a B2B that discovered they invested in luxury features and the market tightened their belts by going with cheaper and simpler competitors. Their n wasn’t really that high but they sure tried their damnedest to make it cubic complexity. “Power” and “flexibility” outnumbered, “straightforward” and even “robust” but at least three to one in conversations. A lot of my favorite people saw there was no winning that conversation and noped out long before I did.

The devs voted with their feet and the customers with their wallets.

There are two many obvious exceptions to even start taking this seriously. If we all followed this advice, we would never even multiply matrices.
> no hope of ever correcting it

That's a pretty bold assumption.

Almost every startup that has succeeded was utterly unscalable at first in tons of technical and business ways. Then they fixed it as they scaled. Over-optimizing early has probably killed far more projects and companies than the opposite.

> That’s a pretty bold assumption.

That’s not a bold assumption it’s the predicate for this entire sidebar. The commenter at the top said some things can’t be done in quadratic time and have to be done anyway, and I took exception.

>> unless a more optimal solution does not exist

Dropping into the middle of a conversation and ignoring the context so you can treat the participants like they are confused or stupid is very bad manners. I’m not grumpy at you I’m grumpy that this is the eleventeenth time this has happened.

> Almost every startup

Almost every startup fails. Do you model your behavior on people who fail >90% of the time? Maybe you, and perhaps by extension we, need to reflect on that.

> Then we fixed it as we scaled

Yes, because you picked a problem that can be architected to run in reasonable time. You elected to do it later. You trusted that you could delay it and turned out to be right.

>> unless a more optimal solution does not exist

When the devs discover the entire premise is unsustainable or nobody knows how to make it sustainable after banging their heads against it, they quickly find someplace else to be and everyone wonders what went wrong. There was a table of ex employees who knew exactly what went wrong but it was impolitic to say. Don’t want the VCs to wake up.

Not all n's grow unbounded with the number of customers. If anything, having a reasonable upper bound for how high a n you have to support is the more common case - and you're going to need that with O(n) as well.
I feel this is too hardline and e.g. eliminates the useful things people do with SAT solvers.
The first SAT solver case that comes to mind is circuit layout, and then you have a k vs n problem. Because you don’t SAT solve per chip, you SAT solve per model and then amortize that cost across the first couple years’ sales. And they’re also “cheating” by copy pasting cores, which means the SAT problem is growing much more slowly than the number of gates per chip. Probably more like n^1/2 these days.

If SAT solvers suddenly got inordinately more expensive you’d use a human because they used to do this but the solver was better/cheaper.

Edit: checking my math, looks like in a 15 year period from around 2005 to 2020, AMD increased the number of cores by about 30x and the transistors per core by about 10x.

Gaussian elimination (for square matrices) is O(n^3) arithmetic operations and it's one of the most important algorithms in any scientific domain.
I’ll allow that perhaps I should have said “cubic” instead of “quadratic” - there are much worse orders in the menagerie than n^3. But it’s a constraint we bang into over and over again. We use these systems because they’re cheaper than humans, yes? People are still trying to shave off hundredths of the exponent in matrix multiplication for instance. It makes the front page of HN every time someone makes a “breakthrough”.
So, how would you write a solver for tower of Hanoi then? Are you saying you wouldn't?
As a business? Would you try to sell a product that behaved like tower of Hanoi or walk away?
Good call. O(N^2) is the worst time complexity because it's fast enough to be instantaneous in all your testing, but slow enough to explode in prod.

I've seen it several times before, and it's exactly what happened here.

We just had this exact problem. Tests ran great, production slowed to a crawl.
I was just helping out with the network at an event. Worked great in testing, but it failed in production due to unicast flooding the network core. Turns out that some of the PoE Ethernet switches had an insufficiently sized CAM for the deployment combined with STP topology changes reducing the effective size of the CAM by a factor of 10 on the larger switches. Gotta love when packet forwarding goes from O(1) to O(n) and O(n^2)! Debugging that in production is non-trivial as the needle is in such a large haystack of packets so as to be nearly impossible to find in the output of tcpdump and wireshark. The horror... The horror...
First big project I worked on a couple of us sped up the db initialization scripts so we could use a less trivial set of test data to stop this sort of shenanigans.

Things like inserting the test data first and turning on constraints and possibly indexes afterward.

Modern computers are pretty great at scanning small blocks of memory repeatedly, so n^2 can be faster than the alternative using a map in cases for small N.

I spent a lot of time fixing n^2 in blink, but there were some fun surprises:

https://source.chromium.org/chromium/chromium/src/+/main:thi...

For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).

This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.

Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.

This is true. But unless you are sure that current and future inputs will always be small I find it is better to start with the algorithm that scales better. Then you can add a special case for small sizes if it turns up in a hot path.

This is because performance is typically less important for the fast/small case and it is generally acceptable for processing twice as much to be twice (or slightly more than twice) as slow, but users are far more likely to hit and really burned by n^2 algorithms in things you thought would almost always be small and you never tested large enough sizes in testing to notice.

I wrote more on this topic here https://kevincox.ca/2023/05/09/less-than-quadratic/

A lot of computations are really higher complexity order functions with stair steps at certain intervals based on hardware trying to pretend they are constant time. All operations cost the same amount until n doubles again and then it’s slower. If you zoom out toward infinity, the stair steps smooth out into a logarithmic curve. It becomes logarithmic in the former case and square root in the latter. Even dividing two numbers or doing a memory address lookup stops being C, which is part of why prime factoring worked for RSA for so long.

If anyone had made clockless logic work you would see that adding 1 + 1 is in fact faster than adding 2^63 + 1.

If you put enough data into a hash table the key length has to increase logarithmically to the table size in order to have distinct keys per record. Even Knuth points out that hash tables are really nlogn - something I’m pretty sure my CS professors left out. In multiple classes. Man, did I get tired of hash tables, but I see now why they harped on them. Case on point, this article.

    > where linear search beats constant time maps
Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.
Pick any compiled language and test it. Pick an algorithm making heavy use of a small (<10, maybe up to a hundred elements) hashset, and try using a linear structure instead. The difference will be most apparent with complicated keys, but even strings of more than a few characters should work.

Some example workloads include:

1. Tokenization (checking if a word is a keyword)

2. Interpretation (mapping an instruction name to its action)

3. ML (encoding low-cardinality string features in something like catboost)

4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)

Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.

It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.

When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.

As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.

You've got to keep in mind that computers aren't the 1-instruction-at-a-time purely sequential machines anymore.

Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines

Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.

This is a good point.

A string search algorithm that uses SIMD to do quickly discard a majority of 16, 32 or 64 attempts in parallel, and then verify the surviving ones quadratically (again 16, 32 or 64 bytes at a time) can go a very long way against a sublinear algorithm that understands needle structure, but necessarily needs to process the haystack one byte at a time.

My rule of thumb for 80%-90% of the problems is, if you need complicated algorithm, it means your data model isn't right. Sure, you do need complicated algorithms for compilers, db internals, route planning et all, but all things considered, those are minority of the use cases.
This is not a complicated algorithm. A hash map (dictionary) or a hash set is how you would always do deduplication in Python, because it is easiest to write / least keystrokes anyway. That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.

    > That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
I am confused. There are plenty of open source, fast hash map impls in C.
Yes, the problem is getting them into your project.
That's only a problem if you have never done any C development.
This isn't knapsack. This is a dict lookup.
I wrote a funny algorithm to group together words that end the same way to write them once in my binary wordlist file, since there is an array that points to the start character and a \0 to end the word. My initial solution was O(n²) but it was too slow on a real wordlist so I had to come up with something better. In the end I just sort the list with quicksort, but revert the order of the words and then the groupable ones end up nearby each other.
I'd say the exception is when `n` is under about 10, and is counting some sort of hardware constrained thing (e.g. some operation over all CAN interfaces pesent on an OBDII connector can be O(n^(2)) since n will always be between 1 and 4). If you wouldn't have to physically replace hardware for `n` to increase, you really need to avoid n^2 operations. And even then consider them carefully, perhaps explicitly failing if `n` gets too big to allow for noticing rework is needed before new hardware hits the field.
> perhaps explicitly failing if `n` gets too big

That's the problem. A lot of these quadratic time algorithms don't set limits.

Even 'n!' is fine for small 'n'. Real production use cases don't have small 'n'.

> Real production use cases don't have small 'n'.

Real production use cases absolutely have small n. You don't hear about them, because it's very hard for them to cause issues. Unless the use case changes and now the n is not small anymore and nobody noticed the trap.

Or, phrased differently, if n has an upper limit, the algorithm is O(1).
As long as you have tests that regularly exercise your algorithm at n=max where you would notice if they were exceptionally slow
I have an app that's been running an O n^2 algorithm in "production" (free open source app used by various communities) for about half a year now.

It's been fine because "n" is "number of aircraft flying in this flight simulator" - and the simulator's engine starts to fail above around 2000 anyway. So even in the worst case it's still going to run within milliseconds.

For those of us without a formal comp sci background, is there a way for the IDE to detect and warn about these automatically? Or any other easy telltale signs to look for?

As a self taught dev, when I encounter nested loops, I have to mentally go through them and try to see if they iterate through each item more than once. But that's not a very foolproof method.

Too much domain knowledge for an IDE to catch. I'm self taught as well, and it comes down to spending more time thinking about the code than writing the code.

It's a fairly simple thought experiment to ask yourself what if there was 10x items in this array? 100x? That is essentially what the O(n) notation is trying to quantify. You just don't need to do it that formally.

I have an n^3 operation that's currently a huge bottleneck at only 10k elements. Not sure how to fix it.
I once looked into tree diffing algorithms (the literature is all about diffing XML even though it's really trees). The obvious dumb algorithm (which seems to be what everyone uses) is n^4.