| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by CyberDildonics 1924 days ago

You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

I'm not really sure how anything I'm saying is even controversial unless someone is desperate for it to be.

Anyone who has profiled and optimized software has experience weeding out excessive memory allocations since it is almost always the lowest hanging fruit.

No matter how fast allocating a few bytes is, doing what you are ultimately trying to do with those bytes is much faster.

Allocating 8 bytes in 150 cycles might seem fast, until you realize that modern CPUs can deal with multiple integers or floats on every single cycle.

A 12 year old CPU can allocate space for, and add together well over 850 million floats to 850 million other floats in a single second on a single core. You can download ISPC and verify this for yourself. By your own numbers, the allocation alone would take about a minute and a half.

Neither of you has confronted this. I'm actually fascinated by the lengths you both have gone to specifically to avoid confronting this. Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice and here I have explained why that is.

2 comments

chrisseaton 1924 days ago

> You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

You're either misunderstanding, or pretending to misunderstand for some reason.

What I said was that allocating fresh objects and using those can be faster than re-using stale objects in some failed attempt to optimise by reducing allocations.

Why would that be? For the reasons I explained: The newly allocated objects are guaranteed to already be in cache. Each new object is guaranteed to be close to the last object you used, because they're allocated next to each other. The new objects are not going to need any memory barriers, because they're guaranteed to not be published. The new objects are less likely to escape, so they're eligible for scalar replacement.

You dismissed all that as 'throwing out terminology'.

Here's a practical example:

  require 'benchmark/ips'

  def clamp_fresh(min, max, value)
    fresh_array = Array.new
    fresh_array[0] = min
    fresh_array[1] = max
    fresh_array[2] = value
    fresh_array.sort!
    fresh_array[1]
  end

  def clamp_cached(cached_array, min, max, value)
    cached_array[0] = min
    cached_array[1] = max
    cached_array[2] = value
    cached_array.sort!
    cached_array[1]
  end

  cached_array = Array.new

  Benchmark.ips do |x|
    x.report("use-fresh-objects") { clamp_fresh(10, 90, rand(0..100)) }
    x.report("use-cached-objects") { clamp_cached(cached_array, 10, 90, rand(0..100)) }
    x.compare!
  end

Which would you think is faster? The one that allocates a new object each iteration of the inner loop? Or the one that re-uses an existing object each time and doesn't allocate anything?

It's actually the one that allocates a new object each time. The cached one is 1.6x slower in an optimising implementation of Ruby.

It's faster... but the only change I made was I added an object allocation instead of the custom object caching. I went from not allocating any objects to allocating an object and it became faster. This example is so clear because of the last factor I mentioned - scalar replacement.

If you came along and 'optimised' my code based on a cargo cult idea of 'object allocation disastrously slow' you wouldn't be helping would you?

link

kragen 1924 days ago

> You realize this person that you are calling an expert

I didn't claim he's an expert, but he did write TruffleRuby, which is still about twice as fast as any other implementation of Ruby six years later: https://pragtob.wordpress.com/2020/08/24/the-great-rubykon-b.... So I think it's safe to say that he's at least minimally competent at performance engineering :)

> I'm not really sure how anything I'm saying is even controversial

You're applying rules of thumb you've learned in one context in a context where they are invalid, and then you're accusing people who disagree with you of being ignorant or dishonest, even when it would take you literally 30 seconds to verify that what we're saying is correct, and where one of us literally has a doctorate in the specific topic we're discussing.

> Allocating 8 bytes in 150 cycles might seem fast

150 cycles doesn't sound terribly fast to me, though it'd be pretty good for malloc/free under most circumstances. I presented benchmark results from my 9-year-old laptop where LuaJIT did an allocation every 120 clock cycles, SBCL did an allocation every 18 cycles, and OCaml did an allocation every 9.5 cycles. You can easily replicate those results on your own machine. And that isn't the time for the allocation alone—it includes the entire loop containing the allocation, which also initializes the memory, and also the garbage-collection time needed to deallocate the space allocated. (And, in the OCaml case, it includes a recursive loop.)

chrisseaton says that in modern VMs like GraalVM an allocation is about 5 instructions, which I'm guessing is about 3 cycles.

(Incidentally on my laptop with glibc 2.2.5 malloc(16) and free() in a loop only takes about 18 ns, about 50 cycles, 55 million allocations per second. It took a little effort to keep GCC from optimizing away the loop. I wasn't satisfied until I'd single-stepped GDB into __GI___libc_malloc!)

120 cycles is less than 150 cycles. 18 cycles is a lot less than 150 cycles. 9.5 cycles is extremely much smaller than 150 cycles. 3 cycles, which I haven't verified but which sounds plausible, is smaller still.

> A 12 year old CPU can ... add together well over 850 million floats to 850 million other floats in a single second on a single core ...By your own numbers, the allocation alone would take about a minute and a half.

You're completely wrong about "By your own numbers."

It's surely true that you aren't going to get SIMD-like performance by chasing pointers down a singly-linked list of floating-point numbers (is that what you're suggesting?) but not because you can only allocate 10 million list nodes per second; it's because you'll trash your cache with all those worthless cdr pointers, the CPU can't prefetch, and every cache miss costs you a 100-ns stall, which often ties up your entire socket's memory bus, as I specifically pointed out in https://news.ycombinator.com/item?id=26438596. It's true that if you use malloc you'd need a minute or two (and 30 gigabytes!) to allocate 850 million such nodes, or maybe three minutes and 60 gigabytes to allocate two such lists. But if you use the kind of allocator we're talking about in this thread, you can do that much allocation in a few seconds rather than a few minutes, although it's still not a good way to represent your matrices and vectors.

> Neither of you has confronted this.

Well, no, why would we? It's just a silly error you made. Before you made it there was no way to confront it.

> Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice

Allocations aren't zero-cost, even when multiple allocations in a basic block get combined as chrisseaton says GraalVM does, unlike any of the three systems I presented measurements from; at a minimum, allocating more requires the GC to run more often, and you need to store a pointer somewhere that points at the separate allocation. (Though maybe not for very long.) As I mentioned in my other comment, I've written a popular essay exploring this topic at http://canonical.org/~kragen/memory-models.

But the time required to allocate a few bytes can vary over two or three orders of magnitude, from the 3.4 ns I got with OCaml (or maybe 1 ns if several allocations get combined), up to the 43 ns I got with LuaJIT, up to the 100 ns malloc/free typically take, up to the 200 ns we're suffering in Hammer, up to the maybe 1000 ns you might get out of a bad implementation of malloc/free, up to maybe several microseconds on a PIC16 or something.

If you're trading off a 5-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should probably use the allocation, although questions like locality of reference down the road might tip the balance the other way. If you're trading off a 100-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should take the cache miss and remove the allocation. If you're trading off a 7-nanosecond small allocation in SBCL against triggering a 10000-nanosecond write-barrier by way of SBCL's SIGSEGV handler, you should definitely take the small allocation.

So, whether lots of small allocations speed your code up or slow it down depends a lot on how much small allocations cost, which depends on the system you're running on top of and the tradeoffs it's designed for.

You're familiar with a particular set of architectural tradeoffs which make small allocations pretty expensive. That's fine, and those tradeoffs might even be in some sense the objectively best choice; certainly they're the best for some range of applications. But you're mistakenly assuming that those tradeoffs are universal, despite overwhelming evidence to the contrary, going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

link

CyberDildonics 1924 days ago

You get so worked up but my point with every message is that excessive memory allocation is a performance killer. Part of the reason is cache misses which you went into here. For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

> despite overwhelming evidence to the contrary,

The overwhelming evidence is that tiny memory allocations will always carry more overhead than operating on the data they contain. Excessive memory allocations imply small memory allocations and small memory allocations will have overhead that outweighs their contents. This actually is about as universal on modern CPUs as you can get. This is why allocating a few bytes at a time in a loop dominates performance, which all I've really been saying.

> going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

Relax, you might learn something

link

kragen 1911 days ago

> my point with every message is that excessive memory allocation is a performance killer

Yes, it was clear that that was what you were saying, despite clear and convincing evidence that it's true in some contexts and false in others.

> For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

chrisseaton knows more about this topic than I ever will. You, by contrast, evidently don't know enough about it to understand why the things we were bringing up were relevant, miscategorizing them as "rabbit holes and straying off topic".

> Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

As anyone who follows the links provided can see, some performance-critical benchmark libraries built with chrisseaton's TruffleRuby run about 1.1× slower than alternatives written in native C, although performance on larger programs is, so far, less impressive. Even the mainstream Ruby implementation MRI is typically only about 30× slower. It's true that ten years ago Ruby had performance overhead of 50× to 100×.

> Relax, you might learn something

Oh, I've learned a lot from this thread. But it wasn't by reading the GPT-2-generated Eliza-bot rants posted under the name "CyberDildonics" with no understanding of the issues; it was by reading chrisseaton's dissertation, writing and running microbenchmarks, and carefully disassembling compiler output.

link