Hacker News new | ask | show | jobs
by ncw33 3589 days ago
In no so many words, the article is pointing out that O(n) + O(1) is not necessarily quicker than O(n) + O(n): both add to O(n) (since only the highest-order factor matters), and the constant factor is unsurprisingly better for array lists.

He's not benchmarking an O(1) operation against an O(n), or anything surprising, or even pointing out a "hidden" O(n) operation, it's simply a demonstration that O(n) + O(1) = O(n).

2 comments

Of course this was a single example.

The main point is that the big O notation ignores the cache misses.

For example:

- try writing a quicksort as a template to avoid the comparison function calls

- in the sorting loop, switch to an insertion sort when buckets have <= 32 elements

You will see speed improvements of 2 to 3 times faster than a vanilla implementation of quicksort.

Although this approach is slightly more complex on the abstract level, it's faster because it reduces lots cache misses.

Surely the main point is that

    O(n) + O(n) = O(2n) = O(n) = O(n + 1) = O(n) + O(1)
and the constant factors left out of O() often dominate execution time.

Cache performance is just one example of a constant factor, right?

Cache performance is not constant, if it was, we wouldn't even consider it when optimizing.

The point is to use data that was recently fetched into the (faster) cache memory as much as possible instead of incuring the penalty of a cache miss

Cache performance really depends on memory access patterns.

Anyways :) I'm being pedantic here, I should probably go back to work.

Memory access time is a constant multiplier, whether it hits or misses the cache. There can be a big difference in that constant depending on the access pattern, but it is still considered a constant factor.

Yes, random access (missing the cache every time) may be 100x slower than sequential (hitting the cache almost always), but if you're iterating through an array twice as large, it will still be 100x slower.

I agree, on the same machine, the slow-downs are constant for each level of cache and for ram access.
One of my favorite ACM papers was an analysis on how the asymptotic runtime behaviors of various sort algorithms didn't tell the full story, and that the faster algorithms took a pretty big data set before they even broke even with some of the other sort algorithms. In fact I think at even 10,000 records it was still about 30% slower than a 'worse' algorithm and you had to get up to sorting a considerable amount of data before it was 'best'.

I don't know about you, but I don't sort 100k records in a single batch, and if I am it's because I messed up. But I might sort different batches of 100 records 1000 times a minute.

Not necessarily. You might want to create an algorithm that iterates in a sequential fashion through memory, since your memory controller will see the iteration and prefetch the next block of memory. This is a huge performance win on larger data structures.
Also, Big-O notation is a theoretical construct for thinking about complexity that works well for academic proofs but glosses over many real world considerations. Those ignored constant factors can be large and the datasets are not always large enough to amortize their cost.
Furthermore amortization isn't always accounted for in Big-O. Inserting into into the head of a linked list is O(1), and for a vector it is O(n). But on modern hardware vector inserts are much faster.

If your software engineering begins and ends with Big-O notation, your likely doing a terrible job. Measure, and test. Always.

I would say, "If your software engineering begins and ends with [jargon] you're likely doing a terrible job. Measure, and test. Always."

As a field, do we measure and test the effect of using various programming paradigms, patterns, and coding standards? I don't think most companies do this. Most developer decisions are made by "gut." As a field, we are still more like the "before" of Moneyball than the after.

There's a third option: analysis and understanding.

This seems to be entirely forgotten by the "measure, test" crowd - the scientific method requires a hypothesis to test for measurements to be meaningful, and separately not all things that are produced are "tested" in every aspect, since many things are well established, and the application of that existing theoretical framework makes measurement and testing a waste of time.

Having a theoretical understanding of the thing you care about is far more important than the "measure, test" step - which is certainly important, but by itself of really limited power.

The programming field isn't lacking in typical programmers with hypotheses. What it's lacking are typical programmers who are checking their hypotheses.
If you take the word hypothesis at its meanest, and ignore everything else I said, sure. But a hypothesis that is grounded in an absence of analysis or understanding of the situation, or an appreciation of the existing body of literature related to a topic, or any other informing principle, is of limited value.

i.e., it's not about checking your hypothesis, it's about checking the method by which you formed your hypothesis.

Checking your hypothesis is fine, but of value only to reject, not to actually find a hypothesis you can accept.

> the scientific method requires a hypothesis to test for measurements to be meaningful

That's the simplistic grade school version. In practice, characterization and measurement is part of a feedback cycle that is used to develop and refine hypotheses. Experiments need a hypothesis to test, but measurements do not require experiments. Where the "measure, test" crowd usually goes off the rails is in not entering that scientific feedback cycle after the initial measurements and instead jumping to conclusions based on gut feel interpretation of them.

I deleted from my comment an explicit calling out of this feedback loop, as it was implied, I thought, by the statement that the measurement and testing were both important, but of limited value by themselves.

However I disagree that measurements have any meaning unless they are conducted as experiments, or potentially as exploratory work looking for an experiment to conduct. Even such exploratory work needs to be driven by at least the loosest of hypotheses, since measuring everything is infeasible. So this is just as much an experiment, just one you don't have a high expectation of predicting the outcome of.

Either way, this recognition of the necessity of at least a minimal hypothesis in all of these scenarios is typically lacking is my fundamental point.

The cycle of revision of that hypothesis is clearly the next step and, as I say, I thought clearly implied.

I am not buying your distinction between measurements and experiments - experiments are measurements with a purpose.

You don't have to start with measurements. You can often do some analysis before you have anything to test, and that may save you a lot of time measuring, testing and tinkering with something that doesn't have a prayer of working. If you are measuring the environment that the system will interact with, you have already done some analysis in determining what to measure, and developed some theories about what is going to matter; that is hard to do without some causal reasoning. If your tests show you have a problem, you need to think about what is going on in order to find a solution.

There's a curious anti-intellectualism in software development that is opposed to any suggestion that thinking about problems is useful. It seems to go hand-in-hand with a simplistic world view that only sees dichotomies, together with the assumption that there can only be one answer.

Hell, let's go a step farther. If your software engineering begins and ends with jargon, you're likely doing a terrible job.

(the business people always pick the worse solution that sounds better unless you can explain why they should care, in english, meant for normal human beings, without sounding condescending about it)

What's worse, the field is constantly changing. What was true on one architecture may not be true on another, and even different generations of a single architecture can change a lot.