| HN Mirror

It is not at all. That's like someone looking at perfect weather, blowing up an air mattress on the beach and wondering why anyone would need a house.

You are only on the 'fast path' when a single allocation of an array would have been much faster anyway. It isn't difficult to be 'fast' when accomplishing something trivial that would be even faster if done directly.

Why do you keeping saying the same thing while ignoring freeing memory, different lifetimes and wildly different sizes of allocations (which are why memory allocators actually exist)?

This idea that memory allocation is cheap is a blight on performance. Are you really allocating memory inside your hot loops and having it not show up when you profile?

> You are only on the 'fast path'

Isn't that what I've said all along? I said 'amortised' in my very first comment. Most allocations are from the fast path... that's why they bother to make it fast.

> when a single allocation of an array would have been much faster anyway

TLAB allocation is heterogenous (which solves your size question.) An array isn't. TLAB can be evacuated and fragmention-free (which solves your lifetime problem.) An array can't.

> Why do you keeping saying the same thing while ignoring freeing memory

I said 'Deallocation though - yes that's slow!' in my very first comment.

Your original comment was trying to argue that allocations aren't a big performance problem. Anyone with experience optimizing knows this is not true and that minimizing allocations is the first thing to look at because it will most likely have the largest payoff with the least effort. Incorrect preconceptions and misinformation doesn't help people.

Saying 'allocations are fast' because you can make unnecessary allocations cheaper while ignoring deallocation is basically a lie, it's playing some sort of language game while telling people the opposite of what is actually true.

> Most allocations are from the fast path... that's why they bother to make it fast.

This is where you have a deep misconception. Doing millions of allocations that could be one allocation is not fast. If you want to add 1000000 to a variable do you loop through it one million times and increment it by 1 every time?

> TLAB allocation is heterogenous (which solves your size question.) An array isn't. TLAB can be evacuated and fragmention-free (which solves your lifetime problem.) An array can't.

That's like building a skyscraper out of legos because they fit together so well. It's nonsensical, especially due to pages and memory mapping.

> I said 'Deallocation though - yes that's slow!' in my very first comment.

Do you have a house with a kitchen and no bathroom? The discussion is that excessive memory allocation is a huge factor in performance. Trying to play language games to ignore the entire cycle doesn't change that and misleads people. Allocated memory needs to be deallocated. Lots of tiny allocations of a few bytes each are a mistake. They cause huge performance problems and should be obviously unnecessary. You wouldn't pick one piece of cereal out of the box, you would pour it in.

Do you profile your programs? Do you work on optimizing them? I have a hard time believing you are doing something performance sensitive while trying to rationalize massive memory allocation waste.

> This is where you have a deep misconception. Doing millions of allocations that could be one allocation is not fast. If you want to add 1000000 to a variable do you loop through it one million times and increment it by 1 every time?

Great example... because guess what happens with TLAB allocation during optimisation? It will do exactly what you say and combine multiple allocations into a simple bump, like it would with any other arithmetic!

> Anyone with experience optimizing knows this is not true ... Do you profile your programs? Do you work on optimizing them? I have a hard time believing you are doing something performance sensitive

I'm not going to keep arguing on this forever. But I have a PhD in language implementation and optimisation and I've be working professionally and publishing in the field for almost a decade. I'd be surprised to find out I have deep misconceptions on the topic.

> Great example... because guess what happens with TLAB allocation during optimisation? It will do exactly what you say and combine multiple allocations into a simple bump, like it would with any other arithmetic!

Before you were saying it isn't a problem at all, now you are saying it isn't a problem because a single java allocator tries to work around it? That's a bandaid over something that you are trying to say isn't even a problem.

> I'm not going to keep arguing on this forever. But I have a PhD in language implementation and optimisation and I've be working professionally and publishing in the field for almost a decade. I'd be surprised to find out I have deep misconceptions on the topic.

But are you profiling and optimizing actual software? That's the real question, because as I keep saying, anyone optimizing software knows that lots of small allocations are the first thing to look for. You still haven't addressed this, even though the whole thread is about decreasing allocations for optimization and you seem to be saying it isn't a problem, which is misinformation.

> Before you were saying it isn't a problem at all

It isn't a problem - I was responding to what you thought was a problem - having to bump multiple times in loop - that's what you asked me about? The bumps are tiny to start with, but even from that tiny start point they still get collapsed. You said collapsing them was important, so I showed you how that also happens.

> as I keep saying, anyone optimizing software knows that lots of small allocations are the first thing to look for

In some cases I've made code faster by putting allocations back in that other people removed without enough context on how allocation works and based on cargo culting of allocation being simply bad like you're pushing. Sometimes allocation is much better than reusing existing allocated space, due to reasons of spatial and temporal locality, cache obliviousness, escape analysis, publication semantics, coherence and NUMA effects, and more.

kragen 1964 days ago

> Anyone with experience optimizing knows this is not true...This is where you have a deep misconception. …Do you profile your programs?

There's Dunning–Kruger, and then there's Dunning–Kruger of the level "telling a guy whose Ph.D. dissertation, Specialising Dynamic Techniques for Implementing The Ruby Programming Language, is about code optimization on GraalVM, that he has a deep misconception, and questioning whether he has any experience optimizing".

I repeat my complaint about "the climate of boastful intellectual vacuity this site fosters."

You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

I'm not really sure how anything I'm saying is even controversial unless someone is desperate for it to be.

Anyone who has profiled and optimized software has experience weeding out excessive memory allocations since it is almost always the lowest hanging fruit.

No matter how fast allocating a few bytes is, doing what you are ultimately trying to do with those bytes is much faster.

Allocating 8 bytes in 150 cycles might seem fast, until you realize that modern CPUs can deal with multiple integers or floats on every single cycle.

A 12 year old CPU can allocate space for, and add together well over 850 million floats to 850 million other floats in a single second on a single core. You can download ISPC and verify this for yourself. By your own numbers, the allocation alone would take about a minute and a half.

Neither of you has confronted this. I'm actually fascinated by the lengths you both have gone to specifically to avoid confronting this. Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice and here I have explained why that is.

> You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

You're either misunderstanding, or pretending to misunderstand for some reason.

What I said was that allocating fresh objects and using those can be faster than re-using stale objects in some failed attempt to optimise by reducing allocations.

Why would that be? For the reasons I explained: The newly allocated objects are guaranteed to already be in cache. Each new object is guaranteed to be close to the last object you used, because they're allocated next to each other. The new objects are not going to need any memory barriers, because they're guaranteed to not be published. The new objects are less likely to escape, so they're eligible for scalar replacement.

You dismissed all that as 'throwing out terminology'.

Here's a practical example:

  require 'benchmark/ips'

  def clamp_fresh(min, max, value)
    fresh_array = Array.new
    fresh_array[0] = min
    fresh_array[1] = max
    fresh_array[2] = value
    fresh_array.sort!
    fresh_array[1]
  end

  def clamp_cached(cached_array, min, max, value)
    cached_array[0] = min
    cached_array[1] = max
    cached_array[2] = value
    cached_array.sort!
    cached_array[1]
  end

  cached_array = Array.new

  Benchmark.ips do |x|
    x.report("use-fresh-objects") { clamp_fresh(10, 90, rand(0..100)) }
    x.report("use-cached-objects") { clamp_cached(cached_array, 10, 90, rand(0..100)) }
    x.compare!
  end

Which would you think is faster? The one that allocates a new object each iteration of the inner loop? Or the one that re-uses an existing object each time and doesn't allocate anything?

It's actually the one that allocates a new object each time. The cached one is 1.6x slower in an optimising implementation of Ruby.

It's faster... but the only change I made was I added an object allocation instead of the custom object caching. I went from not allocating any objects to allocating an object and it became faster. This example is so clear because of the last factor I mentioned - scalar replacement.

If you came along and 'optimised' my code based on a cargo cult idea of 'object allocation disastrously slow' you wouldn't be helping would you?

kragen 1964 days ago

> You realize this person that you are calling an expert

I didn't claim he's an expert, but he did write TruffleRuby, which is still about twice as fast as any other implementation of Ruby six years later: https://pragtob.wordpress.com/2020/08/24/the-great-rubykon-b.... So I think it's safe to say that he's at least minimally competent at performance engineering :)

> I'm not really sure how anything I'm saying is even controversial

You're applying rules of thumb you've learned in one context in a context where they are invalid, and then you're accusing people who disagree with you of being ignorant or dishonest, even when it would take you literally 30 seconds to verify that what we're saying is correct, and where one of us literally has a doctorate in the specific topic we're discussing.

> Allocating 8 bytes in 150 cycles might seem fast

150 cycles doesn't sound terribly fast to me, though it'd be pretty good for malloc/free under most circumstances. I presented benchmark results from my 9-year-old laptop where LuaJIT did an allocation every 120 clock cycles, SBCL did an allocation every 18 cycles, and OCaml did an allocation every 9.5 cycles. You can easily replicate those results on your own machine. And that isn't the time for the allocation alone—it includes the entire loop containing the allocation, which also initializes the memory, and also the garbage-collection time needed to deallocate the space allocated. (And, in the OCaml case, it includes a recursive loop.)

chrisseaton says that in modern VMs like GraalVM an allocation is about 5 instructions, which I'm guessing is about 3 cycles.

(Incidentally on my laptop with glibc 2.2.5 malloc(16) and free() in a loop only takes about 18 ns, about 50 cycles, 55 million allocations per second. It took a little effort to keep GCC from optimizing away the loop. I wasn't satisfied until I'd single-stepped GDB into __GI___libc_malloc!)

120 cycles is less than 150 cycles. 18 cycles is a lot less than 150 cycles. 9.5 cycles is extremely much smaller than 150 cycles. 3 cycles, which I haven't verified but which sounds plausible, is smaller still.

> A 12 year old CPU can ... add together well over 850 million floats to 850 million other floats in a single second on a single core ...By your own numbers, the allocation alone would take about a minute and a half.

You're completely wrong about "By your own numbers."

It's surely true that you aren't going to get SIMD-like performance by chasing pointers down a singly-linked list of floating-point numbers (is that what you're suggesting?) but not because you can only allocate 10 million list nodes per second; it's because you'll trash your cache with all those worthless cdr pointers, the CPU can't prefetch, and every cache miss costs you a 100-ns stall, which often ties up your entire socket's memory bus, as I specifically pointed out in https://news.ycombinator.com/item?id=26438596. It's true that if you use malloc you'd need a minute or two (and 30 gigabytes!) to allocate 850 million such nodes, or maybe three minutes and 60 gigabytes to allocate two such lists. But if you use the kind of allocator we're talking about in this thread, you can do that much allocation in a few seconds rather than a few minutes, although it's still not a good way to represent your matrices and vectors.

> Neither of you has confronted this.

Well, no, why would we? It's just a silly error you made. Before you made it there was no way to confront it.

> Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice

Allocations aren't zero-cost, even when multiple allocations in a basic block get combined as chrisseaton says GraalVM does, unlike any of the three systems I presented measurements from; at a minimum, allocating more requires the GC to run more often, and you need to store a pointer somewhere that points at the separate allocation. (Though maybe not for very long.) As I mentioned in my other comment, I've written a popular essay exploring this topic at http://canonical.org/~kragen/memory-models.

But the time required to allocate a few bytes can vary over two or three orders of magnitude, from the 3.4 ns I got with OCaml (or maybe 1 ns if several allocations get combined), up to the 43 ns I got with LuaJIT, up to the 100 ns malloc/free typically take, up to the 200 ns we're suffering in Hammer, up to the maybe 1000 ns you might get out of a bad implementation of malloc/free, up to maybe several microseconds on a PIC16 or something.

If you're trading off a 5-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should probably use the allocation, although questions like locality of reference down the road might tip the balance the other way. If you're trading off a 100-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should take the cache miss and remove the allocation. If you're trading off a 7-nanosecond small allocation in SBCL against triggering a 10000-nanosecond write-barrier by way of SBCL's SIGSEGV handler, you should definitely take the small allocation.

So, whether lots of small allocations speed your code up or slow it down depends a lot on how much small allocations cost, which depends on the system you're running on top of and the tradeoffs it's designed for.

You're familiar with a particular set of architectural tradeoffs which make small allocations pretty expensive. That's fine, and those tradeoffs might even be in some sense the objectively best choice; certainly they're the best for some range of applications. But you're mistakenly assuming that those tradeoffs are universal, despite overwhelming evidence to the contrary, going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.