Hacker News new | ask | show | jobs
by exDM69 782 days ago
I wrote almost the exact same thing after seeing Sebastian Aaltonen's original offset allocator repo which inspired me to write my own clone in Rust.

It's possible to further improve the fragmentation characteristics if the maximum size of the memory to be allocated is known ahead of time. The original used a 5.3 "floating point" scheme for the free bins, but given the maximum size you can opt for 4.4 or 6.2 or whatever can cover the whole region, giving more dense (or sparse) allocation granularity as needed.

This same idea can also be extended to allocating texture atlas regions when the regions have power of two size. 256 bins can cover all possible texture sizes that GPUs can handle, so you can get hard O(1) texture packing.

I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README). For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.

This algorithm is so clever and kind of a game changer. Allocating and freeing are fast real time operations (only dozens of CPU cycles and a few memory writes). It kind of makes "bump allocators" obsolete because this allocator is almost as fast but supports trimming and freeing allocations as well. This algorithm requires more memory but the amount is quite low.

3 comments

> For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.

A radix tree can solve this: https://en.wikipedia.org/wiki/Radix_tree

I used one way, way back to do exactly the same thing: upon a free, I needed to look up all of the metadata for an address. For a 32-bit address space, you can just allocate a giant array up front, and use the address as an index. For a 64-bit address space, it's obviously way too big to statically allocate it up front. A radix tree neatly solves the problem. An arbitrarily sized radix tree is not constant lookup, but for reasons I honestly cannot remember, for a 64-bit address space, it's guaranteed to only be 3 levels deep.

See my implementation, which (I believe) I borrowed from tcmalloc: https://github.com/scotts/streamflow/blob/master/streamflow....

Radix trees are not O(1), they're O(log n). Data structures that support constant-time predecessor lookup do not exist, there is a super-constant lower bound, even in the RAM model [1].

Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

Asymptotic bounds must be presented and understood in context.

[1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathematic...

This function is O(1): https://github.com/scotts/streamflow/blob/master/streamflow..... All other tree operations are also constant, because the fact that it is a 3-level tree is hardcoded.

Asymptotic bounds are useful when your input can grow arbitrarily large. When your input is fixed, the bounds become less useful and you should focus on the particulars of your case. In the case I'm presenting, the work done traversing the tree will be the same every time; it is O(1). That doesn't necessarily mean it's fast enough! It still may do too much work for the use case. For instance, I can imagine someone saying that the function above does too much pointer chasing for their use case.

> This function is O(1)

I think I addressed that in my comment, but to be more explicit, this function is O(1) too:

  size_t find_sorted(void* object, void** list, size_t size) {
    for (size_t i = 0; i < SIZE_MAX; ++i) {
      if (i < size && list[i] > object) return i;
    }
    return SIZE_MAX;
  }
If O(1) cannot distinguish your function from this function, what is its informational value?

> Asymptotic bounds are useful when your input can grow arbitrarily large

But your inputs can't grow arbitrarily large, that's why you can hardcode 3 levels. O(1) is an asymptotic bound, and my point is that it is not very informative here.

I'm really not sure what you're overall point here is. Yes, I agree with you. Your function is O(2^64) which is technically O(1). Your point, which I agree with, is that's completely useless information. It does not help our understanding of performance at all. Calling such a function O(1) is technically true, but both misleading and not informative. What I'm not clear on is how that relates to the discussion we're having here.

The original poster said they wanted a O(1) solution to a problem. I presented one. That solution happens to be based on a data structure whose algorithms are, in the general case, O(log n). But we're not dealing with a general case, we're dealing with a specific case. And because of that specific case, we can write algorithms that are O(1). Unlike your example, these algorithms have a very small n; 3, to be exact. That is meaningful to describe as O(1) in this case because we can reduce the work down to a small constant.

My overall point is that neither your function or my function are actually O(1). Whenever you see the notation O(...), there is an implicit context "As input size n grows arbitrarily, ...". You can check the formal definition on Wikipedia.

The cost function for both our functions is not defined for arbitrary n, because they both stop working when input size crosses a threshold. So the O(1) notation is not well-defined in this case.

Now you could come up with a different formal definition for O(1) for bounded input sizes, which is fine, but I don't think you can find one that makes your function O(1) and my function non-O(1). So it would be not be a meaningful definition in this case.

Ultimately, you're using O(1) colloquially. In your words, calling my function O(1) is misleading while it is fine for yours because the constant is "small". "Small" is a subjective term, while O(1) is a formal term.

If your definition hinges on a subjective characterization, why not just say "it's fast", instead of incorrectly using a technical term?

(If we really want to be pedantic, there is really no such thing as "constant-time" when accessing memory, a TLB miss for example will make the CPU traverse a tree; a page fault can execute arbitrary code).

This is O(n) because you're still doing the i < size comparison, even though you've moved it out of the for loop.
For almost all n (size), the function runs for MAX_SIZE steps, since almost all numbers are greater than MAX_SIZE. And it never runs for more than MAX_SIZE steps.
Given that pointer size is fixed, maybe asymptotic bounds aren’t what we should care about? Maybe it would be better to ask how the cost increases just going up to 2^64 bytes of address space. The graph after that point is irrelevant.

An O(n^2) algorithm will blow up well before that.

> Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

It's not meaningless. We're designing for a real machine here, and the actual goal is to bound the runtime to a small known constant.

You're right that big O is not great notation for this. Hell, even O(1) isn't good enough because the constant factor is unspecified.

But the misuse of notation does not invalidate the problem. Taking the worst case for a log leaves you with a feasible runtime. Taking the worst case for n or n^2 or "any bounded algorithm" overwhelmingly does not.

> Asymptotic bounds must be presented and understood in context.

Yes, context is exactly what keeps it meaningful!

I did this too! Except it was based on the original TLSF paper (on which this work is based) as I wasn't aware of that offset allocator.

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1). The original TLSF algorithm (as proposed by the paper) was designed this way.

Months later, I also wrote a general purpose implementation of TLSF for Rust `GlobalAlloc`, which, with some rigorous optimizations, turned out to be faster and smaller than every allocator for WebAssembly and embedded environments I tested, and it's still my go-to choice. It didn't fare too bad even when compared against modern thread-caching allocators in normal use cases. Internal fragmentation is probably worse, though.

> In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1).

Yes, but the downside is that now allocator metadata pollutes the cache. It's super efficient for the allocator, but it may harm efficiency of the actual use of that memory. I believe most production allocators don't use object headers for this reason.

> I believe most production allocators don't use object headers for this reason.

Isn’t the original reason hardening against buffer overflows? Overwriting allocator metadata can be used to attack a system.

Yes. I believe that the standard workaround for this problem is to place the allocator metadata for an entire block at some fixed address boundary, so you can just mask off the bottom K bits of an address to find the allocator metadata for the block, and from there find the metadata of the allocation you're concerned with.
Have you published your code to GH, by any chance?
Offset allocator: https://github.com/yvt/xalloc-rs/blob/master/src/tlsf.rs Don't mind the mess, it's one of my earliest works in Rust.

General-purpose allocator: https://github.com/yvt/rlsf

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

Besides the reverse mapping, you'd also have to get rid of the Vec use in the allocator, but that's fairly straightforward. Additionally, you'd need some sort of logic about requesting slabs of memory from the OS (and returning them when free), which is some extra work.