| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by CyberDildonics 1965 days ago

There are no situations where it makes sense to say that lots of allocations are cheap and only cost a few instructions.

If lots of allocations are happening without freeing, more memory will have to be mapped in anyway on top of the constant pointer bumping - which means that a single allocation of an array would make sense instead.

If lots of allocations are happening with freeing, 'objects'/allocations will have to be moved around constantly to cater to the 'cheap' bump allocation.

If lots of allocations are happening then all being freed uniformly at the same time, it makes no sense to use lots of allocations since one array allocation would be fine.

Basically, making lots of tiny allocations marginally cheaper is a terrible strategy since it will always be a drag on performance since it is a naive and wasteful way to write software.

1 comments

kragen 1965 days ago

> There are no situations where it makes sense to say that lots of allocations are cheap and only cost a few instructions.

This is so trivially shown to be false that I suspect you are trolling. There is, for example, the situation I showed in https://news.ycombinator.com/item?id=26438596. Moreover, as gsg says, this is very generally true of pointer-bumping allocators in modern GCs; SBCL's GC isn't even all that modern.

link

CyberDildonics 1965 days ago

What you are talking about is the equivalent of adding 1 to the size of a C++ vector with a large capacity on each iteration of a loop.

The memory has already been mapped in (through a blocking system call) and it is contiguous. The "allocation" here is just the same uniform increment in an unbroken span of memory already in the process without taking into anything that makes memory allocation problematic for performance.

There is no point to the situation you described, it is a nonsense way to do something trivial. If that's what you want, allocate a single array.

The fact remains that memory allocation is often huge low hanging fruit for optimization. If it was 'just a few instructions' this would not be true. Garbage collection does not change this. You can read endless threads about 'performance java' people taking about allocating all their memory when their program starts and never inside their loops.

The reasons why have mostly been explained (system calls, heaps, copying in GCs, cache effects, TLB effects, etc.) though people have left out other nuances like mapped pages triggering interrupts on their first write to save time in the memory mapping call.

I think it will probably help to look into this further, thinking memory allocation is bumping a pointer is a simplification that will confuse any optimization until you understand it and profile.

link

chrisseaton 1965 days ago

> There is no point to the situation you described, it is a nonsense way to do something trivial.

I mean... it's how real allocators work in real VMs, so it's not nonsense, is it?

link

CyberDildonics 1965 days ago

I'm not even sure I understand what you are asking, do you think that memory allocation is always bumping a pointer a uniform amount with no memory being freed, no variance in allocation size and no interlaced lifetimes?

You originally said that memory allocation was 'just a few instructions' when the reality is that it is complex and avoiding allocation is the first thing to do when optimizing software.

Saying that memory allocation is simple and fast based on a scenario that should never happen because it's useless is a little silly. No, that is not how allocators work.

link

chrisseaton 1964 days ago

> do you think that memory allocation is always bumping a pointer a uniform amount with no memory being freed, no variance in allocation size and no interlaced lifetimes?

In modern language implementations with moving collectors allocation is literally what I've described in the fast path. Not theoretical. That's literally the instructions used.

> You originally said that memory allocation was 'just a few instructions'

Here it is in a real industrial compiler and garbage collector, for example.

https://github.com/oracle/graal/blob/f9530b0948c58a66845e84f...

It's precisely as I've described.

link

CyberDildonics 1964 days ago

It is not at all. That's like someone looking at perfect weather, blowing up an air mattress on the beach and wondering why anyone would need a house.

You are only on the 'fast path' when a single allocation of an array would have been much faster anyway. It isn't difficult to be 'fast' when accomplishing something trivial that would be even faster if done directly.

Why do you keeping saying the same thing while ignoring freeing memory, different lifetimes and wildly different sizes of allocations (which are why memory allocators actually exist)?

This idea that memory allocation is cheap is a blight on performance. Are you really allocating memory inside your hot loops and having it not show up when you profile?

link

kragen 1964 days ago

I guess you aren't trolling; you're just confusing the part of programming that you know about with the whole field. But there are more things in heaven and earth, Horatio, than are dreamt of in your philosophy… and your epistemic hygiene is sorely lacking, so maybe "misosophy" is more appropriate.

You said, "If you allocate a a few bytes at a time, it will top out in the ballpark of 10 million per second per core." In my link above, I demonstrated a one-line program which, allocating a few bytes at a time, tops out at 150 million allocations per second per core; if the memory stays in use long enough to survive a minor GC, that drops to 100 million allocations per second per core. It's using the same allocator SBCL uses for everything (except large blocks), and these performance numbers include the time for deallocation. It takes into account all of the things that make memory allocation problematic for performance.† But it does an order of magnitude more allocations per second than you're saying is possible. If you were interested in whether the things you were saying were true or not, you would have tested these claims yourself instead of continuing to post bullshit; SBCL is free software, so they are easy to test.

Even LuaJIT 2.0.5 on this same laptop manages 43 ns per allocation, 23 million per second:

    function nlist(n)
        local rv = nil
        for i = 1, n do
            rv = {i, rv}
        end
        return rv
    end

    function mnlist(m, n)
        print('m='..m..' n='..n)
        for i = 1, m do nlist(n) end
    end

    mnlist(500000, 2000)

SBCL isn't the paragon here. OCaml (3.12.1, ocamlopt -g) manages 3.4 ns per allocation (290 million allocations per second). Compiling this code

    let rec nlist n = if n = 0 then [] else n::nlist (n-1)
    let rec mnlist m n = if m = 0 then [] else (ignore (nlist n); mnlist (m-1) n)
    let m = 2000*1000 and n = 500 ;;

    print_endline ("m=" ^ (string_of_int m) ^ " n=" ^ (string_of_int n)) ;
    mnlist m n

I get a program that executes in 3.4 seconds. The machine code for `nlist` looks like this:

       0x0000000000403530 <+0>:     sub    $0x8,%rsp
       0x0000000000403534 <+4>:     cmp    $0x1,%rax
       0x0000000000403538 <+8>:     jne    0x403548 <camlTimealloc__nlist_1030+24>
       0x000000000040353a <+10>:    mov    $0x1,%rax
       0x0000000000403541 <+17>:    add    $0x8,%rsp
       0x0000000000403545 <+21>:    retq   
       0x0000000000403546 <+22>:    xchg   %ax,%ax
       0x0000000000403548 <+24>:    mov    %rax,(%rsp)
       0x000000000040354c <+28>:    add    $0xfffffffffffffffe,%rax
       0x0000000000403550 <+32>:    callq  0x403530 <camlTimealloc__nlist_1030>
    => 0x0000000000403555 <+37>:    mov    %rax,%rdi
       0x0000000000403558 <+40>:    sub    $0x18,%r15
       0x000000000040355c <+44>:    mov    0x2177fd(%rip),%rax        # 0x61ad60
       0x0000000000403563 <+51>:    cmp    (%rax),%r15
       0x0000000000403566 <+54>:    jb     0x403584 <camlTimealloc__nlist_1030+84>
       0x0000000000403568 <+56>:    lea    0x8(%r15),%rax
       0x000000000040356c <+60>:    movq   $0x800,-0x8(%rax)
       0x0000000000403574 <+68>:    mov    (%rsp),%rbx
       0x0000000000403578 <+72>:    mov    %rbx,(%rax)
       0x000000000040357b <+75>:    mov    %rdi,0x8(%rax)
       0x000000000040357f <+79>:    add    $0x8,%rsp
       0x0000000000403583 <+83>:    retq   
       0x0000000000403584 <+84>:    callq  0x411898 <caml_call_gc>
       0x0000000000403589 <+89>:    jmp    0x403558 <camlTimealloc__nlist_1030+40>

As may be apparent, the open-coded allocation here consists of these five instructions; OCaml's allocation bump pointer moves downward rather than upward, and it tags its cons cells with an 8-byte in-memory 0x800 header word instead of a 7 in the low 4 bits of the pointer:

       0x0000000000403558 <+40>:    sub    $0x18,%r15
       0x000000000040355c <+44>:    mov    0x2177fd(%rip),%rax        # 0x61ad60
       0x0000000000403563 <+51>:    cmp    (%rax),%r15
       0x0000000000403566 <+54>:    jb     0x403584 <camlTimealloc__nlist_1030+84>
       0x0000000000403568 <+56>:    lea    0x8(%r15),%rax

It's true, as you say, that the way it works is similar to "adding [small numbers] to the size of a C++ vector with a large capacity on each iteration of a loop." (It's not "adding 1" because allocations of all sizes are served from the same nursery; interspersing different open-coded allocation sizes affects performance only a little.) But just doing that doesn't save you from writing a garbage collector and implementing write barriers, or alternatively doing MLKit-style static reasoning about lifetimes the way you do to allocate things on a per-frame heap in C++.

It's also true that, as you say, "memory allocation is often huge low hanging fruit for optimization." You aren't going to get this kind of performance out of HotSpot or a malloc implementation, not even mimalloc. So if you're using one of those systems, memory allocation is an order of magnitude more expensive. And, if you're concerned about worst-case performance—latency, rather than throughput, as HFT people and AAA game programmers are—probably no kind of garbage collection is a good idea, and you may even need to avoid allocation altogether, although recent versions of HotSpot make some remarkable claims about worst-case latency, claims which may be true for all I know.

Of course, even when it comes to throughput, there is no free lunch. An allocator design so heavily optimized for fast allocation necessarily makes mutation more expensive—SBCL notoriously uses segfaults for its write barrier so that writes into the nursery are as fast as possible, a cost that would be intolerable for more mutation-oriented languages like Java or C++; and heavy mutation is generally a necessary evil if latency is important. (Also, I think there are algorithms, especially in numerical computation, where the best known mutation-based algorithms have a logarithmic speedup over the best known pure functional algorithms.)

You can find a more detailed discussion of some of the issues in https://archive.fo/itW87 (Martin Cracauer's comparison of implementing memory allocation in LLVM and SBCL, including years of experience running SBCL in production and extensive discussion of the latency–throughput tradeoff I touch on above) and the mimalloc technical report, https://www.microsoft.com/en-us/research/publication/mimallo.... The mimalloc report, among other things explains how they found that, for mimalloc, BBN-LISP-style per-page free-lists (Bobrow & Murphy 1966, AD647601, AFCRL-66-774) were faster than pointer-bumping allocation!

______

† This build of SBCL does have multithreading enabled, and the allocation benchmark takes the same amount of time running in a separate thread, but on my machine it doesn't get a very good speedup if run in multiple threads, presumably due to some kind of allocator contention. (I know that SBCL's GC has to stop all mutator threads during a collection, but the original allocation benchmark here only spends 5% of its time in the GC, so this isn't enough to explain the observed multithreading slowdown, where one thread takes 6.5 ns per allocation, while two threads each take 9.2, three take 17, and four take 29, so although this is a 4-core CPU, at that point multithreading is imposing a slowdown rather than providing a speedup. I suspect that this version of SBCL isn't providing a per-thread nursery the way modern VMs do, so allocation-heavy code like this gets serialized.)

link