| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kragen 1964 days ago

I guess you aren't trolling; you're just confusing the part of programming that you know about with the whole field. But there are more things in heaven and earth, Horatio, than are dreamt of in your philosophy… and your epistemic hygiene is sorely lacking, so maybe "misosophy" is more appropriate.

You said, "If you allocate a a few bytes at a time, it will top out in the ballpark of 10 million per second per core." In my link above, I demonstrated a one-line program which, allocating a few bytes at a time, tops out at 150 million allocations per second per core; if the memory stays in use long enough to survive a minor GC, that drops to 100 million allocations per second per core. It's using the same allocator SBCL uses for everything (except large blocks), and these performance numbers include the time for deallocation. It takes into account all of the things that make memory allocation problematic for performance.† But it does an order of magnitude more allocations per second than you're saying is possible. If you were interested in whether the things you were saying were true or not, you would have tested these claims yourself instead of continuing to post bullshit; SBCL is free software, so they are easy to test.

Even LuaJIT 2.0.5 on this same laptop manages 43 ns per allocation, 23 million per second:

    function nlist(n)
        local rv = nil
        for i = 1, n do
            rv = {i, rv}
        end
        return rv
    end

    function mnlist(m, n)
        print('m='..m..' n='..n)
        for i = 1, m do nlist(n) end
    end

    mnlist(500000, 2000)

SBCL isn't the paragon here. OCaml (3.12.1, ocamlopt -g) manages 3.4 ns per allocation (290 million allocations per second). Compiling this code

    let rec nlist n = if n = 0 then [] else n::nlist (n-1)
    let rec mnlist m n = if m = 0 then [] else (ignore (nlist n); mnlist (m-1) n)
    let m = 2000*1000 and n = 500 ;;

    print_endline ("m=" ^ (string_of_int m) ^ " n=" ^ (string_of_int n)) ;
    mnlist m n

I get a program that executes in 3.4 seconds. The machine code for `nlist` looks like this:

       0x0000000000403530 <+0>:     sub    $0x8,%rsp
       0x0000000000403534 <+4>:     cmp    $0x1,%rax
       0x0000000000403538 <+8>:     jne    0x403548 <camlTimealloc__nlist_1030+24>
       0x000000000040353a <+10>:    mov    $0x1,%rax
       0x0000000000403541 <+17>:    add    $0x8,%rsp
       0x0000000000403545 <+21>:    retq   
       0x0000000000403546 <+22>:    xchg   %ax,%ax
       0x0000000000403548 <+24>:    mov    %rax,(%rsp)
       0x000000000040354c <+28>:    add    $0xfffffffffffffffe,%rax
       0x0000000000403550 <+32>:    callq  0x403530 <camlTimealloc__nlist_1030>
    => 0x0000000000403555 <+37>:    mov    %rax,%rdi
       0x0000000000403558 <+40>:    sub    $0x18,%r15
       0x000000000040355c <+44>:    mov    0x2177fd(%rip),%rax        # 0x61ad60
       0x0000000000403563 <+51>:    cmp    (%rax),%r15
       0x0000000000403566 <+54>:    jb     0x403584 <camlTimealloc__nlist_1030+84>
       0x0000000000403568 <+56>:    lea    0x8(%r15),%rax
       0x000000000040356c <+60>:    movq   $0x800,-0x8(%rax)
       0x0000000000403574 <+68>:    mov    (%rsp),%rbx
       0x0000000000403578 <+72>:    mov    %rbx,(%rax)
       0x000000000040357b <+75>:    mov    %rdi,0x8(%rax)
       0x000000000040357f <+79>:    add    $0x8,%rsp
       0x0000000000403583 <+83>:    retq   
       0x0000000000403584 <+84>:    callq  0x411898 <caml_call_gc>
       0x0000000000403589 <+89>:    jmp    0x403558 <camlTimealloc__nlist_1030+40>

As may be apparent, the open-coded allocation here consists of these five instructions; OCaml's allocation bump pointer moves downward rather than upward, and it tags its cons cells with an 8-byte in-memory 0x800 header word instead of a 7 in the low 4 bits of the pointer:

       0x0000000000403558 <+40>:    sub    $0x18,%r15
       0x000000000040355c <+44>:    mov    0x2177fd(%rip),%rax        # 0x61ad60
       0x0000000000403563 <+51>:    cmp    (%rax),%r15
       0x0000000000403566 <+54>:    jb     0x403584 <camlTimealloc__nlist_1030+84>
       0x0000000000403568 <+56>:    lea    0x8(%r15),%rax

It's true, as you say, that the way it works is similar to "adding [small numbers] to the size of a C++ vector with a large capacity on each iteration of a loop." (It's not "adding 1" because allocations of all sizes are served from the same nursery; interspersing different open-coded allocation sizes affects performance only a little.) But just doing that doesn't save you from writing a garbage collector and implementing write barriers, or alternatively doing MLKit-style static reasoning about lifetimes the way you do to allocate things on a per-frame heap in C++.

It's also true that, as you say, "memory allocation is often huge low hanging fruit for optimization." You aren't going to get this kind of performance out of HotSpot or a malloc implementation, not even mimalloc. So if you're using one of those systems, memory allocation is an order of magnitude more expensive. And, if you're concerned about worst-case performance—latency, rather than throughput, as HFT people and AAA game programmers are—probably no kind of garbage collection is a good idea, and you may even need to avoid allocation altogether, although recent versions of HotSpot make some remarkable claims about worst-case latency, claims which may be true for all I know.

Of course, even when it comes to throughput, there is no free lunch. An allocator design so heavily optimized for fast allocation necessarily makes mutation more expensive—SBCL notoriously uses segfaults for its write barrier so that writes into the nursery are as fast as possible, a cost that would be intolerable for more mutation-oriented languages like Java or C++; and heavy mutation is generally a necessary evil if latency is important. (Also, I think there are algorithms, especially in numerical computation, where the best known mutation-based algorithms have a logarithmic speedup over the best known pure functional algorithms.)

You can find a more detailed discussion of some of the issues in https://archive.fo/itW87 (Martin Cracauer's comparison of implementing memory allocation in LLVM and SBCL, including years of experience running SBCL in production and extensive discussion of the latency–throughput tradeoff I touch on above) and the mimalloc technical report, https://www.microsoft.com/en-us/research/publication/mimallo.... The mimalloc report, among other things explains how they found that, for mimalloc, BBN-LISP-style per-page free-lists (Bobrow & Murphy 1966, AD647601, AFCRL-66-774) were faster than pointer-bumping allocation!

______

† This build of SBCL does have multithreading enabled, and the allocation benchmark takes the same amount of time running in a separate thread, but on my machine it doesn't get a very good speedup if run in multiple threads, presumably due to some kind of allocator contention. (I know that SBCL's GC has to stop all mutator threads during a collection, but the original allocation benchmark here only spends 5% of its time in the GC, so this isn't enough to explain the observed multithreading slowdown, where one thread takes 6.5 ns per allocation, while two threads each take 9.2, three take 17, and four take 29, so although this is a 4-core CPU, at that point multithreading is imposing a slowdown rather than providing a speedup. I suspect that this version of SBCL isn't providing a per-thread nursery the way modern VMs do, so allocation-heavy code like this gets serialized.)