| HN Mirror

> This is where you have a deep misconception. Doing millions of allocations that could be one allocation is not fast. If you want to add 1000000 to a variable do you loop through it one million times and increment it by 1 every time?

Great example... because guess what happens with TLAB allocation during optimisation? It will do exactly what you say and combine multiple allocations into a simple bump, like it would with any other arithmetic!

> Anyone with experience optimizing knows this is not true ... Do you profile your programs? Do you work on optimizing them? I have a hard time believing you are doing something performance sensitive

I'm not going to keep arguing on this forever. But I have a PhD in language implementation and optimisation and I've be working professionally and publishing in the field for almost a decade. I'd be surprised to find out I have deep misconceptions on the topic.

> Great example... because guess what happens with TLAB allocation during optimisation? It will do exactly what you say and combine multiple allocations into a simple bump, like it would with any other arithmetic!

Before you were saying it isn't a problem at all, now you are saying it isn't a problem because a single java allocator tries to work around it? That's a bandaid over something that you are trying to say isn't even a problem.

> I'm not going to keep arguing on this forever. But I have a PhD in language implementation and optimisation and I've be working professionally and publishing in the field for almost a decade. I'd be surprised to find out I have deep misconceptions on the topic.

But are you profiling and optimizing actual software? That's the real question, because as I keep saying, anyone optimizing software knows that lots of small allocations are the first thing to look for. You still haven't addressed this, even though the whole thread is about decreasing allocations for optimization and you seem to be saying it isn't a problem, which is misinformation.

> Before you were saying it isn't a problem at all

It isn't a problem - I was responding to what you thought was a problem - having to bump multiple times in loop - that's what you asked me about? The bumps are tiny to start with, but even from that tiny start point they still get collapsed. You said collapsing them was important, so I showed you how that also happens.

> as I keep saying, anyone optimizing software knows that lots of small allocations are the first thing to look for

In some cases I've made code faster by putting allocations back in that other people removed without enough context on how allocation works and based on cargo culting of allocation being simply bad like you're pushing. Sometimes allocation is much better than reusing existing allocated space, due to reasons of spatial and temporal locality, cache obliviousness, escape analysis, publication semantics, coherence and NUMA effects, and more.

To be clear, you specifically brought up an optimization that you are now saying was done for no reason since it gains nothing.

Now you are trying to say that you actually optimize programs by taking minimal memory allocation and instead using millions of tiny allocations of a few bytes each? Do I have this right? Feel free to link that example.

> Sometimes allocation is much better than reusing existing allocated space, due to reasons of spatial and temporal locality, cache obliviousness, escape analysis, publication semantics, coherence and NUMA effects, and more.

Are we to the part where you are just throwing out terminology?

Are you really digging in this hard? Virtually everything written about optimizing memory allocations is about doing it less.

Even this article is about decreasing memory allocations for performance. You made a claim that is very unsound in a pragmatic sense and anyone with experience would recognize that. You can just own up to that instead of going into a territory of wild claims and diversions.

> You can just own up to that instead of going into a territory of wild claims and diversions.

Mate I don’t know why you’ve gotten so worked up and aggressive about this.

But if you think I don’t know what I’m talking about despite all my work and publications on the subject then feel free to ignore me and go about your day happily.

No one is being worked up or aggressive, but I suppose this is the 'focus on how it was said' part.

To recap you started with a bold claim and instead of confronting or explaining the large holes you repeated the same thing more forcefully, gish galloped onto irrelevant issues, directly contradicted yourself with your own example, made an even greater claim about more allocations being faster, then moved on to calling me "aggressive".

Why not just confront the actual topic and the actual points I made? Why work so hard to avoid them?

kragen 1922 days ago

> Anyone with experience optimizing knows this is not true...This is where you have a deep misconception. …Do you profile your programs?

There's Dunning–Kruger, and then there's Dunning–Kruger of the level "telling a guy whose Ph.D. dissertation, Specialising Dynamic Techniques for Implementing The Ruby Programming Language, is about code optimization on GraalVM, that he has a deep misconception, and questioning whether he has any experience optimizing".

I repeat my complaint about "the climate of boastful intellectual vacuity this site fosters."

You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

I'm not really sure how anything I'm saying is even controversial unless someone is desperate for it to be.

Anyone who has profiled and optimized software has experience weeding out excessive memory allocations since it is almost always the lowest hanging fruit.

No matter how fast allocating a few bytes is, doing what you are ultimately trying to do with those bytes is much faster.

Allocating 8 bytes in 150 cycles might seem fast, until you realize that modern CPUs can deal with multiple integers or floats on every single cycle.

A 12 year old CPU can allocate space for, and add together well over 850 million floats to 850 million other floats in a single second on a single core. You can download ISPC and verify this for yourself. By your own numbers, the allocation alone would take about a minute and a half.

Neither of you has confronted this. I'm actually fascinated by the lengths you both have gone to specifically to avoid confronting this. Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice and here I have explained why that is.

> You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

You're either misunderstanding, or pretending to misunderstand for some reason.

What I said was that allocating fresh objects and using those can be faster than re-using stale objects in some failed attempt to optimise by reducing allocations.

Why would that be? For the reasons I explained: The newly allocated objects are guaranteed to already be in cache. Each new object is guaranteed to be close to the last object you used, because they're allocated next to each other. The new objects are not going to need any memory barriers, because they're guaranteed to not be published. The new objects are less likely to escape, so they're eligible for scalar replacement.

You dismissed all that as 'throwing out terminology'.

Here's a practical example:

  require 'benchmark/ips'

  def clamp_fresh(min, max, value)
    fresh_array = Array.new
    fresh_array[0] = min
    fresh_array[1] = max
    fresh_array[2] = value
    fresh_array.sort!
    fresh_array[1]
  end

  def clamp_cached(cached_array, min, max, value)
    cached_array[0] = min
    cached_array[1] = max
    cached_array[2] = value
    cached_array.sort!
    cached_array[1]
  end

  cached_array = Array.new

  Benchmark.ips do |x|
    x.report("use-fresh-objects") { clamp_fresh(10, 90, rand(0..100)) }
    x.report("use-cached-objects") { clamp_cached(cached_array, 10, 90, rand(0..100)) }
    x.compare!
  end

Which would you think is faster? The one that allocates a new object each iteration of the inner loop? Or the one that re-uses an existing object each time and doesn't allocate anything?

It's actually the one that allocates a new object each time. The cached one is 1.6x slower in an optimising implementation of Ruby.

It's faster... but the only change I made was I added an object allocation instead of the custom object caching. I went from not allocating any objects to allocating an object and it became faster. This example is so clear because of the last factor I mentioned - scalar replacement.

If you came along and 'optimised' my code based on a cargo cult idea of 'object allocation disastrously slow' you wouldn't be helping would you?

kragen 1922 days ago

> You realize this person that you are calling an expert

I didn't claim he's an expert, but he did write TruffleRuby, which is still about twice as fast as any other implementation of Ruby six years later: https://pragtob.wordpress.com/2020/08/24/the-great-rubykon-b.... So I think it's safe to say that he's at least minimally competent at performance engineering :)

> I'm not really sure how anything I'm saying is even controversial

You're applying rules of thumb you've learned in one context in a context where they are invalid, and then you're accusing people who disagree with you of being ignorant or dishonest, even when it would take you literally 30 seconds to verify that what we're saying is correct, and where one of us literally has a doctorate in the specific topic we're discussing.

> Allocating 8 bytes in 150 cycles might seem fast

150 cycles doesn't sound terribly fast to me, though it'd be pretty good for malloc/free under most circumstances. I presented benchmark results from my 9-year-old laptop where LuaJIT did an allocation every 120 clock cycles, SBCL did an allocation every 18 cycles, and OCaml did an allocation every 9.5 cycles. You can easily replicate those results on your own machine. And that isn't the time for the allocation alone—it includes the entire loop containing the allocation, which also initializes the memory, and also the garbage-collection time needed to deallocate the space allocated. (And, in the OCaml case, it includes a recursive loop.)

chrisseaton says that in modern VMs like GraalVM an allocation is about 5 instructions, which I'm guessing is about 3 cycles.

(Incidentally on my laptop with glibc 2.2.5 malloc(16) and free() in a loop only takes about 18 ns, about 50 cycles, 55 million allocations per second. It took a little effort to keep GCC from optimizing away the loop. I wasn't satisfied until I'd single-stepped GDB into __GI___libc_malloc!)

120 cycles is less than 150 cycles. 18 cycles is a lot less than 150 cycles. 9.5 cycles is extremely much smaller than 150 cycles. 3 cycles, which I haven't verified but which sounds plausible, is smaller still.

> A 12 year old CPU can ... add together well over 850 million floats to 850 million other floats in a single second on a single core ...By your own numbers, the allocation alone would take about a minute and a half.

You're completely wrong about "By your own numbers."

It's surely true that you aren't going to get SIMD-like performance by chasing pointers down a singly-linked list of floating-point numbers (is that what you're suggesting?) but not because you can only allocate 10 million list nodes per second; it's because you'll trash your cache with all those worthless cdr pointers, the CPU can't prefetch, and every cache miss costs you a 100-ns stall, which often ties up your entire socket's memory bus, as I specifically pointed out in https://news.ycombinator.com/item?id=26438596. It's true that if you use malloc you'd need a minute or two (and 30 gigabytes!) to allocate 850 million such nodes, or maybe three minutes and 60 gigabytes to allocate two such lists. But if you use the kind of allocator we're talking about in this thread, you can do that much allocation in a few seconds rather than a few minutes, although it's still not a good way to represent your matrices and vectors.

> Neither of you has confronted this.

Well, no, why would we? It's just a silly error you made. Before you made it there was no way to confront it.

> Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice

Allocations aren't zero-cost, even when multiple allocations in a basic block get combined as chrisseaton says GraalVM does, unlike any of the three systems I presented measurements from; at a minimum, allocating more requires the GC to run more often, and you need to store a pointer somewhere that points at the separate allocation. (Though maybe not for very long.) As I mentioned in my other comment, I've written a popular essay exploring this topic at http://canonical.org/~kragen/memory-models.

But the time required to allocate a few bytes can vary over two or three orders of magnitude, from the 3.4 ns I got with OCaml (or maybe 1 ns if several allocations get combined), up to the 43 ns I got with LuaJIT, up to the 100 ns malloc/free typically take, up to the 200 ns we're suffering in Hammer, up to the maybe 1000 ns you might get out of a bad implementation of malloc/free, up to maybe several microseconds on a PIC16 or something.

If you're trading off a 5-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should probably use the allocation, although questions like locality of reference down the road might tip the balance the other way. If you're trading off a 100-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should take the cache miss and remove the allocation. If you're trading off a 7-nanosecond small allocation in SBCL against triggering a 10000-nanosecond write-barrier by way of SBCL's SIGSEGV handler, you should definitely take the small allocation.

So, whether lots of small allocations speed your code up or slow it down depends a lot on how much small allocations cost, which depends on the system you're running on top of and the tradeoffs it's designed for.

You're familiar with a particular set of architectural tradeoffs which make small allocations pretty expensive. That's fine, and those tradeoffs might even be in some sense the objectively best choice; certainly they're the best for some range of applications. But you're mistakenly assuming that those tradeoffs are universal, despite overwhelming evidence to the contrary, going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

You get so worked up but my point with every message is that excessive memory allocation is a performance killer. Part of the reason is cache misses which you went into here. For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

> despite overwhelming evidence to the contrary,

The overwhelming evidence is that tiny memory allocations will always carry more overhead than operating on the data they contain. Excessive memory allocations imply small memory allocations and small memory allocations will have overhead that outweighs their contents. This actually is about as universal on modern CPUs as you can get. This is why allocating a few bytes at a time in a loop dominates performance, which all I've really been saying.

> going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

Relax, you might learn something

kragen 1908 days ago

> my point with every message is that excessive memory allocation is a performance killer

Yes, it was clear that that was what you were saying, despite clear and convincing evidence that it's true in some contexts and false in others.

> For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

chrisseaton knows more about this topic than I ever will. You, by contrast, evidently don't know enough about it to understand why the things we were bringing up were relevant, miscategorizing them as "rabbit holes and straying off topic".

> Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

As anyone who follows the links provided can see, some performance-critical benchmark libraries built with chrisseaton's TruffleRuby run about 1.1× slower than alternatives written in native C, although performance on larger programs is, so far, less impressive. Even the mainstream Ruby implementation MRI is typically only about 30× slower. It's true that ten years ago Ruby had performance overhead of 50× to 100×.

> Relax, you might learn something

Oh, I've learned a lot from this thread. But it wasn't by reading the GPT-2-generated Eliza-bot rants posted under the name "CyberDildonics" with no understanding of the issues; it was by reading chrisseaton's dissertation, writing and running microbenchmarks, and carefully disassembling compiler output.