Hacker News new | ask | show | jobs
by JeanPierre 4250 days ago
I understand your concern, but I don't really think the problem is within Clojure people: When I started out explaining the structure to other people and that most operations have O(log n) runtime, people dismissed Clojure as being too slow for practical purposes. Part of this is because programmers, not Clojure devs in particular, connotate O(log n) with "much slower than O(1)". However, this data structure is incredibly much faster than the "O(log n)" people are generally used to.

And this is the problem. As you yourself say, most programmers usually associate/assume "Big-O notation is intended to describe how the time required for an operation grows as the number of items being handled gets large".

But formally, this is not the case at all. O(g(x)) tells you the that the worst case input to the algorithm will require less operations (usually operations, sometimes memory or other things) than the function g multiplied by some positive constant. g is usually determined by the size of the input. (This definition is also simplified somewhat, but it illustrates the problem)

We can say that ArrayLists.add have O(n) runtime and that quicksort have O(n²) runtime, but they both also have O(n³) and O(n!) runtime.

In addition, Big-O does not describe how the algorithm runs in the average case, or how frequent the worst case is. In and of itself, the Big-O notation tells us almost nothing about an algorithm – you barely conclude anything from it. I guess that is part of why it has become this very simplified term with a very wibbly-wobbly definition from time to time.

So when you're saying that calling it "practically O(1)" is lying, I agree if this was in an academic context. But for the average programmer (regardless of language!), O(g(x)) is usually thought of as the "worst case runtime for an algorithm". Since the actual runtime is – well – effectively constant, I considered it okay.

---

That being said, I have formally explained the persistent vector if you're interested in the academic version. The background chapter in http://hypirion.com/thesis.pdf should cover it.

3 comments

I understand that log-time algorithms have a marketing problem. I think the correct way to address that is the one I already suggested: give benchmark numbers for lookup, relative to ArrayList, for a typical collection size (or maybe for a few sample sizes -- 32, 1024, and 32768, perhaps). These will clearly show both that lookup is fast on small seqs and that it slows down logarithmically.

I would be more sympathetic to your argument that O(log n) is "practically" O(1) if you would put it in context. It's certainly true that how much the difference matters depends on the application. For someone doing real-time work (HFT, for example) it could be critical. For the average web app, it matters extremely little; the programmer productivity gains offered by functional collections far outweigh the small cost in CPU cycles. I totally agree with that argument, and have made it myself many times, but I still wouldn't say that log time is "practically" constant time for all applications.

Agreed on the benchmarks – that's the next (and last) part on the list I'm going to do with this series.

And yes, of course, at some point, constant factors break into applications. Would it be better if add in a short comment on "practically O(1) for non-HPC applications", or would you phrase it differently to give it better context?

Ah, good -- numbers will clarify the situation considerably.

Personally my preference would be to argue that for applications that aren't terribly performance-sensitive, log time in general just isn't that bad. Even binary trees are quite usable for many purposes. (I believe the Linux kernel uses red-black trees for various things.) For a general-purpose sequence data structure that will be used in a variety of ways, a very good case can be made that it's worth accepting log-time lookup in order to get log-time insertion, subsequence, and concatenation -- operations that all take linear time on an ArrayList (and linear time will kill you far sooner than log time!).

On top of that, Hickey, following Bagwell, has done a great job at keeping the constant factors down, as (I presume) the benchmark numbers will show.

Another point I would make is that true random access into a seq is not the only way elements are retrieved. Often this is done via an iterator. I presume the Clojure seq iterator takes amortized constant time per element (it should); if so, I think it would be worth pointing that out.

All that said, I guess I could live with "practically O(1) for non-HPC applications".

Sorry I was so combative. I had not looked closely at Clojure seqs, so I do appreciate this series of posts you have written.

you should probably try "practically O(1) for the average web app", if you want to keep that phrasing.

Better to skip the "practically O(1)" etc bits and just say O(log n).

Cool you seem to be the author.

Instead of telling people that it's O(log n) you should just show benchmarks instead.

Hitting L1 takes 1ns but main memory takes 100ns. So although you can say accessing an element in an array is always O(1) = O(100), it is hiding a factor of a 100 depending on how it is accessed. Having a log in the big Oh usually means it is jumping around a lot and is no longer a cache friendly algorithm, so that worries me more than whether it is log_2(x) or log_2(x)/5.

> However, this data structure is incredibly much faster than the "O(log n)" people are generally used to.

Is it? I'll bet it's actually much slower than familiar O(log n) operations like binary search, in particular due to the memory allocations.

If you're comparing a binary search to a lookup on a Clojure hashmap, there should be no memory allocations necessary.