Hacker News new | ask | show | jobs
by lionkor 783 days ago
> Please note that offset-allocator isn't a Rust allocator conforming to the GlobalAlloc trait

Aw. Why not?

> find the next available bin using 2x LZCNT instructions to make all operations O(1)

Isn't that sort of implementation defined? The LZCNT operation itself is a loop over all bits -- the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64. But if it was 16 or 32 bit, it may be faster, which means its not O(1), or rather, big o notation really breaks down here.

6 comments

> It just makes it O(n) where n := 64

That's O(1).

> The LZCNT operation itself is a loop over all bits

That's not how circuits work. You can easily make a circuit of O(log n) depth that returns base-2 encoded index of the first 1 bit in a n-bit register.

But since n is just 64 here, you're looking at a circuit of AND/OR depth ~8, so it's no surprise that modern CPUs can compute the LZCNT / TZCNT of a 64-bit integer in a single CPU cycle.

Note that rustc will not emit this instruction by default, to support older CPUs. Without `target-cpu=haswell` or similar, the operation will be quite slow, but still O(1) of course.
It's not that bad, all x86 processors have the BSR instruction which can be mapped to LZCNT semantics with just a few extra instructions.

Left to right - modern x86-64, baseline x86-64, baseline i686:

https://rust.godbolt.org/z/nGsM35TEo

Maybe you're thinking of POPCNT, which hurts a lot more if it doesn't compile down to the native instruction:

https://rust.godbolt.org/z/xcxG3v4Mn

> That's O(1).

A loop over 2^64 elements is also O(1), but I don't think people are ok to label every program that can run on a 64 bit pc as O(1).

You should really refresh yourself on Big-O notation.
Whether you call that O(1) or O(n) depends on the level of abstraction you want to talk about, and what you want to analyse.

For example, in some sense, anything you can do on a finite computer is O(1) (or alternatively, it's an infinite loop). But that is a very boring sense, and seldom useful for analysis.

There's a formal definition for big-O notation, but most of the parameters to that formal definition are only implied in casual discussions.

> Aw. Why not?

Because the global allocator API is defined as free(pointer address) and this used free(buffer handle).

It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

> the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64

O(n) when n is constant is equal to O(1).

But lzcnt uses a fixed number of clock cycles so it's not really O(n) either.

GlobalAlloc is also required to be thread safe, and TLSF doesn't even attempt to handle that. I suppose you could get away with it on single-threaded platforms like microcontrollers or (vanilla) WebAssembly but it wouldn't generalize unless you wrapped it in a mutex, and then the performance would be terrible.
Correct me if I'm wrong but can't you put a Rc<RefCell<OffsetAllocator>> in your std::alloc::Allocator implementation for interior mutability?

This make a non-thread safe allocator, with the (desired) side effect of making anything allocating from it (e.g. Vec) a !Send.

Or is there a requirement that Allocators must be Sync and Send?

This is about GlobalAlloc, the trait that actually exists in stable Rust. Only static variables can be marked as the #[global_allocator], and statics are always required to be Send + Sync. Of course, if you're really sure that only a single thread will ever exist, you can pinky-promise it by putting a couple unsafe impls on the allocator type. But then the language won't protect you if you do try to allocate or deallocate from multiple threads.

(More practically, you can just write a GlobalAlloc that forwards to an instance of a thread-local allocator. Though that runs the risk of thread-local deinitialization running in an unexpected order.)

Any uc worth programming on has interrupts!
Even with interrupts there is almost always a single execution unit, and thus single-threaded. An interrupt is just a jump instruction to a predefined handler. The code that was running is entirely stopped until the system decides to resume from that point.
That doesn't mean that naive single-threaded code is necessarily interrupt-safe. If the allocator is interrupted and the interrupt service routine needs to make use of the allocator's data structures for any reason there could very much be a conflict.

This particular algorithm is focused on GPUs, so I'm not clear on the applicability. However I did want to clear up the idea that there are no concurrency concerns for single-threaded library code, an assumption that's been the source of many bugs.

The presence of interrupts is basically equivalent to not being single threaded, though. There are true single-threaded environments, it's just limited to userspace processes that don't handle signals.
> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Well, another solution is enlarging each allocation by 8 bytes and storing the buffer handle in front of the returned allocation. So `free(ptr)` would get the handle as `*((ptr as *const u64) - 1)`.

That works but...

This kind of allocators are usually used for suballocating GPU buffers, so hiding a few bytes of metadata "in-band" can mess up your alignment requirements and not all kinds of GPU memory are even accessible by CPU. Due to false sharing cache problems you would probably want a full cache line (64 bytes) to store the metadata.

For a CPU-only memory allocator your idea could work quite well. It can also be implemented on top of this code without any modifications.

OTOH, I think this would be a good fit for the "Store" proposal [1] which uses handles rather than addresses to refer to allocations.

[1] https://github.com/matthieu-m/storage/blob/main/etc/rfc.md

> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Not necessarily. If you are able to map a block of normal memory in a known location relative to the GPU memory that you are managing with an "offset allocator", it should be possible to directly calculate the metadata location for each "offset". This is how most allocators find arenas/buckets (whatever you want to call them) for an allocation by a pointer.

Something like this:

  +-------------------------+ 0x000000 (start of managed memory)
  | Metadata                | 
  |                         |
  |                         |
  | ...                     |
  +-------------------------+ 0x001000 
  | Padding ...             |
  +-------------------------+ 0x010000
  | GPU Memory Block        |
  |                         |
  |                         |
  |                         | ~2MB block
  |                         |
  |                         |
  |                         |
  +-------------------------+ 0x210000
With this layout, to get to a metadata for an allocation, all you need to do is to align down the allocation pointer and calculate the appropriate location in the metadata page.

This obviously won't work in all scenarios, but it's a simple and practical way around a map lookup.

A good idea but not O(1).

If you have a large allocation you have to mark every "page" in metadata table which makes allocation O(size in bytes / page size).

The overhead might still be practically acceptable, even if it is not constant time.

Any big O discussion usually breaks down (or at least, changes) if you start taking into account the data size of the operands.

You generally assume for big O purposes, when analyzing sorts for example, that comparisons of elements are constant time, and that swapping elements is constant time.

On an n-bit computer, when dealing with m-bit elements, those assumptions are broadly sound. Comparing two ints doesn’t depend on the ints. Reading or writing the int from memory doesn’t take more or less time either.

But the same algorithm, while it has the same big O relationship to the size of a collection, might have a different constant factor on different CPUs and for different values of m. And you might find that some algorithms that constant factor’s relationship to m has its own big O.

One common place this can bite you is when you try to apply big O analysis that has ‘constant time comparison’ assumption built in to, say, sorting arbitrarily long strings, or inserting into a hash table using arbitrary string keys. Comparing strings or calculating hashes of them is not constant time.

In a practical sense, how you define big O depends on what you consider to be the inputs to the function. If the runtime doesn't change depending on the size of the inputs that you care about, it's O(1).

Like, you might have a function with 2 inputs, N and M, and the run time is like O(m2^n), but n is fixed at a low number every time you'll run it in practice, so it's really just O(m) for your purposes.

Right. O(f(n)) is literally only defined for situations where n 1: varies between different runs of the algorithm, and 2: can grow arbitrarily large. Even though in practice ‘arbitrarily large’ is always limited by memory, storage, etc.

Talking about algorithms being O(n) in the number of bits in a value is only reasonable if the number of bits in the value actually varies between runs.

In all x86 this is an instruction that typically takes 3 clock cycles (latency) with 1 clock cycle throughput. That’s as constant time as you get.

Even if that were false, it’s O(k) where k is the number of bits, which is constant. However, in that case it might be slower than the alternatives.

Integer addition, and indeed pretty much all operations, are also a loops over all bits, LZCNT is in no way at all unique here.