Hacker News new | ask | show | jobs
by alex3305 1024 days ago
In my experience, just a little bit of insider knowledge goes a long ways to making better code. Arrays are fun things, especially when you do a deep dive into the System.arraycopy() function. But the same goes for all Collections in Java. For instance, most of them have a default size (mostly 10), and growing them is a costly operation. So knowing beforehand how large your collection can or may be, can benefit code. I could use this effectively when working with large document XML parsing.

I recommend everyone that uses a managed language (Java, C# or others) to at least get a basic understanding of these fundamentals. And also know which collection type to use when.

5 comments

> For instance, most of them have a default size (mostly 10), and growing them is a costly operation. So knowing beforehand how large your collection can or may be, can benefit code.

It's really a tricky balance. Over-allocating collections "just in case" can quite often be very expensive as well, since large array allocations tend to be fairly slow (since e.g. they typically won't fit in the TLAB).

It's one of those things where you usually have to let profiling and other observations guide your approach. 99.9% of the time it doesn't really matter and the default behavior is fine. But I can think of a few times where this has been a big deal.

One in particular - I was profiling an application with low-latency needs and GC was taking up a ton of time. Mission control showed tons of allocations of arrays - at one point it was creating a bunch of lists in a loop and adding stuff to them, which triggered creating a new underlying array. We found that a) Many of the arrays were just over the first resizing size, and b) There was a good heuristic that we could use to give them an initial size that would never have to be expanded and wouldn't result in huge amounts of waste.

This had a pretty dramatic effect on our GC times and the overall latency. I think this is where the JVM really shines - tons of tooling to help you profile and observe these kinds of details to help you figure out when you actually need to care about stuff like the initial array capacity.

Depends a lot on what you're doing too. I do a fair bit of heavy data processing work with my search engine (tokenizing something like a billion documents into arrays of words etc), and allocator contention has a pretty huge performance impact for that type of work.

My intuition is that the best thing is to aim for the expected median size, rather than the maximum as one might assume would be the most performant. The maximum strategy minimizes re-allocations, but at the expense of always making costlier allocations.

I think it depends a lot on the other details, especially how expensive the extra GC will be vs the wasted space. Hard to give a rule that will work in all contexts.

In our case, it wasn't a single hard-coded number - the input data gave us the upper bound, and the difference between the upper bound and the median case was so small that going with the upper bound worked out best.

> It's really a tricky balance. Over-allocating collections "just in case" can quite often be very expensive as well

It is sometimes really tricky. When I worked with streaming XML documents that were gigabytes in size, there is a really fine margin you have to work with.

However some general knowledge can be pretty useful. I saw colleagues just do "= new ArrayList<?>(1000);" without considering the collection type or possible size. And besides being a bit ignorant, it can also be really confusing for other developers that take first look at such code.

> TLAB

TLB?

The TLAB is the Thread Local Allocation Buffer.

In short and a bit simplified, normally when you allocate memory, the allocator needs to synchronize between threads because RAM is a shared resource. This means that a thread that allocates a lot can disrupt the performance of other threads, among other weird effects. But there's a small buffer called the TLAB owned by each thread where this isn't true: Allocation in the TLAB doesn't require synchronization. The TLAB makes allocating small ephemeral objects much faster.

This is a good explanation. See also Shipilev's JVM Anatomy {Park|Quark} episode: https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/
Thread Local Allocation Buffer
I cringe every single time I see a for loop for what System.arraycopy () has been providing since early days.

For better or worse, it shows me that the author isn't that into Java.

I cannot for my life remember the argument order, so I write the manual code and let IntelliJ convert it.
Doesn't autocomplete show the arguments? I usually use Netbeans when I write Java, so no idea if InelliJ is just that bad.
IntelliJ definitely shows the arguments lol
It shows the argument names and even highlights the current one where the cursor is. But sometimes my thought process is just different.
In Java, it's src array, src offset, dest array, dest offset, length. It's a natural order of from, to.

It's C memcpy() that's the odd one out by putting the destination before the source.

memcpy argument order matches the left-to-right arrangement of assignment. lhs=rhs is rhs is copied to lhs. memcpy(lhs,rhs) is the contents at rhs is copied to lhs.
British people generally don't, but Americans very often use "to...from".
They do, and it's jarring!

Even though I find memcpy and friends to be perfectly logical using the assignment analogy suggested by thwarted, I often need to re-read english sentences written that way.

> I cringe every single time I see a for loop for what System.arraycopy () has been providing since early days.

The worst thing is, that System.arraycopy() is an optimized JNI call which is much faster than copying it by hand [1].

> For better or worse, it shows me that the author isn't that into Java.

The thing is though, most of the time arrays in Java are used because of performance. Or maybe ignorance. Because why would anyone voluntarily give up all the comforts of a List<T>? It's not that Collections are very hard to find in the documentation. And most of the IntelliJ suggest switching to a Collection anyway.

1. https://www.javaspecialists.eu/archive/Issue124-Copying-Arra...

Or it might be that the person has used multiple programming languages, across which the order/meaning of copy arguments varies a lot, and thus prefer to not remember the decision of each language (if not for writing (at which point the IDE could help), then for reading). Whereas a loop is always easy to read and write equally in all languages, and it's really not unreasonable to expect it to perform well enough (if not as good as System.arraycopy, then at least good enough to be insignificant compared to the actual important logic in the code).
Given that I have programed dozen of languages since 1986, and have to jump between C#, Java, C++, Typescript, Transact SQL and PL/SQL for work, plus whatever is needed to keep the customer happy, isn't an argument I would sympathise with in code reviews.
Yet, somehow I doubt you write perfect code in all those languages. Do you cringe at yourself and conclude you just don’t care also?
No I don't, and when someone cringes looking at my code, I shut up, apply the fix and get to improve my skills on the language, instead of excusing myself.
So you agree you should be judged as someone who doesn’t care in those cases also, right? You didn’t mention anything about people excusing themselves initially, just that you judged them. I just hope you hold yourself to the same standard.
There are more options than "write perfect code" and "not even attempt to learn how to write idiomatic code".
Yeah, of course. But maybe you should just be happy to share knowledge with those lacking it, rather than cringing and making some kind of personal judgement of them.
I agree. Understanding the inner working of languages and their runtimes is IMHO what gets you one step closer to a senior. Luckily, I had in my young career few seniors in the team who knew a lot about Java and shared their knowledge about the behavior.
Does anyone have any good book recommendations or links for insider knowledge of the JVM/Java? If Clojure focused all the better :)
There is JVM Anatomy Quarks.

I can also recommend reading the JVM specification itself, it is surprisingly not as dry as one might think, and not a novel, it’s a good read. Oh and of course anything written by Brian Goetz, usually about some new feature.

Maybe it's not really up your alley. But I learned Java with the Java in Action with BlueJ [1]. Although it's pretty basic, the text book really explains all the Java (and OOM) basics in a pretty clear way. The book is called Objects First [2].

In addition I really enjoyed exploring the JDK documentation. Especially Java <1.7 is extremely manageable. Java 8 introduced NIO and lambda's which make Java way more fun, but also a tad harder to learn.

It's not exactly JVM, but just wanted to share anyway :).

1. https://www.java.com/en/java_in_action/bluej.jsp

2. https://www.bluej.org/objects-first/

The default size of an ArrayList has been 0 for a while. On the first insertion, it is initialized to 10.
That's a bit semantic, isn't it? Because in practice it's still 10, but lazily initialized [1]. And an empty ArrayList is useless anyway.

1. https://stackoverflow.com/a/34250231