Hacker News new | ask | show | jobs
by marginalia_nu 1035 days ago
> For instance, most of them have a default size (mostly 10), and growing them is a costly operation. So knowing beforehand how large your collection can or may be, can benefit code.

It's really a tricky balance. Over-allocating collections "just in case" can quite often be very expensive as well, since large array allocations tend to be fairly slow (since e.g. they typically won't fit in the TLAB).

3 comments

It's one of those things where you usually have to let profiling and other observations guide your approach. 99.9% of the time it doesn't really matter and the default behavior is fine. But I can think of a few times where this has been a big deal.

One in particular - I was profiling an application with low-latency needs and GC was taking up a ton of time. Mission control showed tons of allocations of arrays - at one point it was creating a bunch of lists in a loop and adding stuff to them, which triggered creating a new underlying array. We found that a) Many of the arrays were just over the first resizing size, and b) There was a good heuristic that we could use to give them an initial size that would never have to be expanded and wouldn't result in huge amounts of waste.

This had a pretty dramatic effect on our GC times and the overall latency. I think this is where the JVM really shines - tons of tooling to help you profile and observe these kinds of details to help you figure out when you actually need to care about stuff like the initial array capacity.

Depends a lot on what you're doing too. I do a fair bit of heavy data processing work with my search engine (tokenizing something like a billion documents into arrays of words etc), and allocator contention has a pretty huge performance impact for that type of work.

My intuition is that the best thing is to aim for the expected median size, rather than the maximum as one might assume would be the most performant. The maximum strategy minimizes re-allocations, but at the expense of always making costlier allocations.

I think it depends a lot on the other details, especially how expensive the extra GC will be vs the wasted space. Hard to give a rule that will work in all contexts.

In our case, it wasn't a single hard-coded number - the input data gave us the upper bound, and the difference between the upper bound and the median case was so small that going with the upper bound worked out best.

> It's really a tricky balance. Over-allocating collections "just in case" can quite often be very expensive as well

It is sometimes really tricky. When I worked with streaming XML documents that were gigabytes in size, there is a really fine margin you have to work with.

However some general knowledge can be pretty useful. I saw colleagues just do "= new ArrayList<?>(1000);" without considering the collection type or possible size. And besides being a bit ignorant, it can also be really confusing for other developers that take first look at such code.

> TLAB

TLB?

The TLAB is the Thread Local Allocation Buffer.

In short and a bit simplified, normally when you allocate memory, the allocator needs to synchronize between threads because RAM is a shared resource. This means that a thread that allocates a lot can disrupt the performance of other threads, among other weird effects. But there's a small buffer called the TLAB owned by each thread where this isn't true: Allocation in the TLAB doesn't require synchronization. The TLAB makes allocating small ephemeral objects much faster.

This is a good explanation. See also Shipilev's JVM Anatomy {Park|Quark} episode: https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/
Thread Local Allocation Buffer