Hacker News new | ask | show | jobs
by jws 1567 days ago
A very well thought out set of functionality. It covers a lot of ground. In particular:

• The reference counting of the bft_view can simplify reasoning in a C program.

• The cheap bft_view may keep programmers from copying as much data. It's easy to fall into a "copy everything" mode in C since that's what the runtime library provides and consumes. Working with slices reduces your cache pressure and is just generally faster.

• Allocating only blocks sized by a power of two is potentially memory wasting. The exponent for the size has a range of 0...255, but it only needs to get to 2^32 since that's the largest allowed length. I suspect there is some clever base between 1 and 2 which is trivial to compute the exponent and inverse in integer arithmetic. (Or similarly shaped function, it doesn't have to be an exponential, just something that grows about like one.) This would provide better granularity.

• Alternatively, some bits could be harvested from 'cap' to say how the memory should be freed, usually it is 'free', but one could imaging adding a few lines to support 'mmap', then the release becomes 'munmap'.

• Now that we have imagined up mmap capability, this looks like a very nice way to handle many kinds of data files which are not character oriented streams processed sequentially.

3 comments

(author)

> A very well thought out set of functionality

Thanks! It's only an experiment at this stage but I was pleasantly surprised it worked in such a small size.

> slices (...) generally faster

It works wonders. The upcoming bft_split function is gonna be dead fast.

I optimize things by adopting the laziest possible personna : do I really need to copy this ? No. What if the user wants to append stuff to a split part ? We'll see then.

> Allocating only blocks sized by a power of two is potentially memory wasting

It just follows the classical doubling of requested size. In high exponents though I admit it'll be wasteful.

> some clever base between 1 and 2

I wouldn't have thought of it, nor known how to. Would it be somehow costly ? Shifting is basically free.

> support 'mmap'

Interesting. But shouldn't we keep the lib surface minimal and just leave this to the user ? She maps then takes a view on it ?

Hey, an opportunity to shill my favorite ISO standard!

It's 3. ISO 3, preferred numbers: https://en.wikipedia.org/wiki/Preferred_number#Computer_engi...

For your application, I would apply a heuristic which includes the x3 values at some point (256KiB?), and maybe the x5 values near/at 256MiB.

Allocation is so expensive relative to calculating the allocated value that you could use something really complex and it would work. Certainly multiplying by a factor between one and two and rounding to the nearest word would work, there's no way you'd see a performance difference from the calculation.

The advantage of ISO 3 is that the user always gets a familiar number of allocated bytes back.

> > some clever base between 1 and 2

> I wouldn't have thought of it, nor known how to. Would it be somehow costly ? Shifting is basically free.

Multiplying by 1.5 is a shift and an add, so just as cheap.

Multiplying by 1.5 has the advantage that 1 + 1.5 > 1.5 * 1.5, so the second time you realloc it fits into the old memory (which can help reduce memory fragmentation if the allocator takes advantage of it).

> Multiplying by 1.5 has the advantage that 1 + 1.5 > 1.5 * 1.5, so the second time you realloc it fits into the old memory (which can help reduce memory fragmentation if the allocator takes advantage of it).

I’m really struggling to think of any mainstream allocator that can take advantage of this.

The clever base between 1 and 2 you are looking for is the golden ratio. I would use it only for bigger numbers though. under 5k 2 is good enough.
Power of two reallocs are usually there to prevent append in a loop from becoming accidentally quadratic so I guess it's either a bug or a feature depending on if you're going to do that.
As long as there is a non-zero multiplier, it does not matter the size of the constant factor, and it does not even need to be constant. I spent a while recently looking at the shape of this non-typical allocation growth curve, which ended up going with `n₁ = n₀ + 4n₀^(7/8) + n₀/8` using only integer math:

    // compute maxsize = maxsize + 4*maxsize^(7/8) + maxsize/8
    // for small n, we grow faster than O(n)
    // for large n, we grow at O(n/8)
    // and as we reach O(memory) for memory>>1MB,
    // this means we end by adding about 10% of memory each time
And was graphed, discussed, and approved here:

https://github.com/JuliaLang/julia/pull/40453/files#diff-fb6...

Python is similar[1]

>The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

https://github.com/python/cpython/blob/3.7/Objects/listobjec...

Any multiplier > 1 is enough to keep a "grow by constant each iteration" loop from becoming quadratic.
> I suspect there is some clever base between 1 and 2 which is trivial to compute

Something like this perhaps (although this might be too much granularity):

  (8 + (cap & 7)) << ((cap >> 3) + 1)

  16 18 20 ... 30
  32 36 40 ... 60
  64 72 80 ... 120
  128 ... etc.
I remember seeing a paper that concluded the expansion should not be more than the golden ratio (ϕ = 1.618033988749). Given how widespread the golden ratio, I'd hypothesise that it's probably optimal.

The first three fractions approximating ϕ are: 3/2, 8/5 and 21/13. The first two are quick and easy to compute.

If the multiplier is less than the golden ratio, then the new allocation fits into the memory released from the previous two allocations.
But most allocators will not actually take advantage of that, so it’s a pretty dubious argument.
Good point, I haven't actually checked if any realloc implementations do that.
I thought, oh, you only need two bits for the multiplier, that gives you four multiples of each tier size, that should be enough.

But that means the largest buffer (with a slightly less sophisticated formula than yours) is ((2 << 2) * 2 << (2 <<(8 - 2) - 1)) = 32 exabytes.

Meanwhile, yours has a maximum of 60 GB. So i don't think you have too much granularity!