Hacker News new | ask | show | jobs
by ot 782 days ago
Radix trees are not O(1), they're O(log n). Data structures that support constant-time predecessor lookup do not exist, there is a super-constant lower bound, even in the RAM model [1].

Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

Asymptotic bounds must be presented and understood in context.

[1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathematic...

3 comments

This function is O(1): https://github.com/scotts/streamflow/blob/master/streamflow..... All other tree operations are also constant, because the fact that it is a 3-level tree is hardcoded.

Asymptotic bounds are useful when your input can grow arbitrarily large. When your input is fixed, the bounds become less useful and you should focus on the particulars of your case. In the case I'm presenting, the work done traversing the tree will be the same every time; it is O(1). That doesn't necessarily mean it's fast enough! It still may do too much work for the use case. For instance, I can imagine someone saying that the function above does too much pointer chasing for their use case.

> This function is O(1)

I think I addressed that in my comment, but to be more explicit, this function is O(1) too:

  size_t find_sorted(void* object, void** list, size_t size) {
    for (size_t i = 0; i < SIZE_MAX; ++i) {
      if (i < size && list[i] > object) return i;
    }
    return SIZE_MAX;
  }
If O(1) cannot distinguish your function from this function, what is its informational value?

> Asymptotic bounds are useful when your input can grow arbitrarily large

But your inputs can't grow arbitrarily large, that's why you can hardcode 3 levels. O(1) is an asymptotic bound, and my point is that it is not very informative here.

I'm really not sure what you're overall point here is. Yes, I agree with you. Your function is O(2^64) which is technically O(1). Your point, which I agree with, is that's completely useless information. It does not help our understanding of performance at all. Calling such a function O(1) is technically true, but both misleading and not informative. What I'm not clear on is how that relates to the discussion we're having here.

The original poster said they wanted a O(1) solution to a problem. I presented one. That solution happens to be based on a data structure whose algorithms are, in the general case, O(log n). But we're not dealing with a general case, we're dealing with a specific case. And because of that specific case, we can write algorithms that are O(1). Unlike your example, these algorithms have a very small n; 3, to be exact. That is meaningful to describe as O(1) in this case because we can reduce the work down to a small constant.

My overall point is that neither your function or my function are actually O(1). Whenever you see the notation O(...), there is an implicit context "As input size n grows arbitrarily, ...". You can check the formal definition on Wikipedia.

The cost function for both our functions is not defined for arbitrary n, because they both stop working when input size crosses a threshold. So the O(1) notation is not well-defined in this case.

Now you could come up with a different formal definition for O(1) for bounded input sizes, which is fine, but I don't think you can find one that makes your function O(1) and my function non-O(1). So it would be not be a meaningful definition in this case.

Ultimately, you're using O(1) colloquially. In your words, calling my function O(1) is misleading while it is fine for yours because the constant is "small". "Small" is a subjective term, while O(1) is a formal term.

If your definition hinges on a subjective characterization, why not just say "it's fast", instead of incorrectly using a technical term?

(If we really want to be pedantic, there is really no such thing as "constant-time" when accessing memory, a TLB miss for example will make the CPU traverse a tree; a page fault can execute arbitrary code).

Ah, in my context, N is the number of live allocated objects that the memory allocator knows about. If you use a data structure like a red-black tree to track the metadata, the work you do traversing and maintaining the tree will grow log N with the number of live allocated objects you're tracking. The radix tree specialization I presented is constant with respect to the number of live allocated objects.
This is O(n) because you're still doing the i < size comparison, even though you've moved it out of the for loop.
For almost all n (size), the function runs for MAX_SIZE steps, since almost all numbers are greater than MAX_SIZE. And it never runs for more than MAX_SIZE steps.
Given that pointer size is fixed, maybe asymptotic bounds aren’t what we should care about? Maybe it would be better to ask how the cost increases just going up to 2^64 bytes of address space. The graph after that point is irrelevant.

An O(n^2) algorithm will blow up well before that.

> Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

It's not meaningless. We're designing for a real machine here, and the actual goal is to bound the runtime to a small known constant.

You're right that big O is not great notation for this. Hell, even O(1) isn't good enough because the constant factor is unspecified.

But the misuse of notation does not invalidate the problem. Taking the worst case for a log leaves you with a feasible runtime. Taking the worst case for n or n^2 or "any bounded algorithm" overwhelmingly does not.

> Asymptotic bounds must be presented and understood in context.

Yes, context is exactly what keeps it meaningful!