Hacker News new | ask | show | jobs
by kragen 1921 days ago
> You realize this person that you are calling an expert

I didn't claim he's an expert, but he did write TruffleRuby, which is still about twice as fast as any other implementation of Ruby six years later: https://pragtob.wordpress.com/2020/08/24/the-great-rubykon-b.... So I think it's safe to say that he's at least minimally competent at performance engineering :)

> I'm not really sure how anything I'm saying is even controversial

You're applying rules of thumb you've learned in one context in a context where they are invalid, and then you're accusing people who disagree with you of being ignorant or dishonest, even when it would take you literally 30 seconds to verify that what we're saying is correct, and where one of us literally has a doctorate in the specific topic we're discussing.

> Allocating 8 bytes in 150 cycles might seem fast

150 cycles doesn't sound terribly fast to me, though it'd be pretty good for malloc/free under most circumstances. I presented benchmark results from my 9-year-old laptop where LuaJIT did an allocation every 120 clock cycles, SBCL did an allocation every 18 cycles, and OCaml did an allocation every 9.5 cycles. You can easily replicate those results on your own machine. And that isn't the time for the allocation alone—it includes the entire loop containing the allocation, which also initializes the memory, and also the garbage-collection time needed to deallocate the space allocated. (And, in the OCaml case, it includes a recursive loop.)

chrisseaton says that in modern VMs like GraalVM an allocation is about 5 instructions, which I'm guessing is about 3 cycles.

(Incidentally on my laptop with glibc 2.2.5 malloc(16) and free() in a loop only takes about 18 ns, about 50 cycles, 55 million allocations per second. It took a little effort to keep GCC from optimizing away the loop. I wasn't satisfied until I'd single-stepped GDB into __GI___libc_malloc!)

120 cycles is less than 150 cycles. 18 cycles is a lot less than 150 cycles. 9.5 cycles is extremely much smaller than 150 cycles. 3 cycles, which I haven't verified but which sounds plausible, is smaller still.

> A 12 year old CPU can ... add together well over 850 million floats to 850 million other floats in a single second on a single core ...By your own numbers, the allocation alone would take about a minute and a half.

You're completely wrong about "By your own numbers."

It's surely true that you aren't going to get SIMD-like performance by chasing pointers down a singly-linked list of floating-point numbers (is that what you're suggesting?) but not because you can only allocate 10 million list nodes per second; it's because you'll trash your cache with all those worthless cdr pointers, the CPU can't prefetch, and every cache miss costs you a 100-ns stall, which often ties up your entire socket's memory bus, as I specifically pointed out in https://news.ycombinator.com/item?id=26438596. It's true that if you use malloc you'd need a minute or two (and 30 gigabytes!) to allocate 850 million such nodes, or maybe three minutes and 60 gigabytes to allocate two such lists. But if you use the kind of allocator we're talking about in this thread, you can do that much allocation in a few seconds rather than a few minutes, although it's still not a good way to represent your matrices and vectors.

> Neither of you has confronted this.

Well, no, why would we? It's just a silly error you made. Before you made it there was no way to confront it.

> Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice

Allocations aren't zero-cost, even when multiple allocations in a basic block get combined as chrisseaton says GraalVM does, unlike any of the three systems I presented measurements from; at a minimum, allocating more requires the GC to run more often, and you need to store a pointer somewhere that points at the separate allocation. (Though maybe not for very long.) As I mentioned in my other comment, I've written a popular essay exploring this topic at http://canonical.org/~kragen/memory-models.

But the time required to allocate a few bytes can vary over two or three orders of magnitude, from the 3.4 ns I got with OCaml (or maybe 1 ns if several allocations get combined), up to the 43 ns I got with LuaJIT, up to the 100 ns malloc/free typically take, up to the 200 ns we're suffering in Hammer, up to the maybe 1000 ns you might get out of a bad implementation of malloc/free, up to maybe several microseconds on a PIC16 or something.

If you're trading off a 5-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should probably use the allocation, although questions like locality of reference down the road might tip the balance the other way. If you're trading off a 100-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should take the cache miss and remove the allocation. If you're trading off a 7-nanosecond small allocation in SBCL against triggering a 10000-nanosecond write-barrier by way of SBCL's SIGSEGV handler, you should definitely take the small allocation.

So, whether lots of small allocations speed your code up or slow it down depends a lot on how much small allocations cost, which depends on the system you're running on top of and the tradeoffs it's designed for.

You're familiar with a particular set of architectural tradeoffs which make small allocations pretty expensive. That's fine, and those tradeoffs might even be in some sense the objectively best choice; certainly they're the best for some range of applications. But you're mistakenly assuming that those tradeoffs are universal, despite overwhelming evidence to the contrary, going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

1 comments

You get so worked up but my point with every message is that excessive memory allocation is a performance killer. Part of the reason is cache misses which you went into here. For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

> despite overwhelming evidence to the contrary,

The overwhelming evidence is that tiny memory allocations will always carry more overhead than operating on the data they contain. Excessive memory allocations imply small memory allocations and small memory allocations will have overhead that outweighs their contents. This actually is about as universal on modern CPUs as you can get. This is why allocating a few bytes at a time in a loop dominates performance, which all I've really been saying.

> going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

Relax, you might learn something

> my point with every message is that excessive memory allocation is a performance killer

Yes, it was clear that that was what you were saying, despite clear and convincing evidence that it's true in some contexts and false in others.

> For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

chrisseaton knows more about this topic than I ever will. You, by contrast, evidently don't know enough about it to understand why the things we were bringing up were relevant, miscategorizing them as "rabbit holes and straying off topic".

> Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

As anyone who follows the links provided can see, some performance-critical benchmark libraries built with chrisseaton's TruffleRuby run about 1.1× slower than alternatives written in native C, although performance on larger programs is, so far, less impressive. Even the mainstream Ruby implementation MRI is typically only about 30× slower. It's true that ten years ago Ruby had performance overhead of 50× to 100×.

> Relax, you might learn something

Oh, I've learned a lot from this thread. But it wasn't by reading the GPT-2-generated Eliza-bot rants posted under the name "CyberDildonics" with no understanding of the issues; it was by reading chrisseaton's dissertation, writing and running microbenchmarks, and carefully disassembling compiler output.