Hacker News new | ask | show | jobs
by vitalyd 3633 days ago
The fantastic standard library mostly goes away because it allocates. It's possible to write Java code that doesn't allocate in steady state, but the coding style becomes terrible (e.g. overuse of primitives, mutable wrappers, manually flattened data structures, etc).

There's also the issue that even without GC running you pay the cost of card marking (GC store barriers) on every reference write. There's unpredictability due to deoptimizations occurring due to type/branch profile changes, safepoints occurring due to housekeeping, etc.

It's unclear whether that style of Java coding is actually a net win over using languages with better performance model.

2 comments

Sometimes you just need to allocate, whether due to necessity or expediency.

If you make sure that "almost all" allocations are short-lived, GC is very fast. Allocation is bumping a pointer and cleanup is O(number of new, live objects). It's considerably faster than malloc/free for general-case allocation.

It all depends on what time scale we're talking about since "very fast" is relative. High performance native systems don't use (in any meaningful manner) naive malloc/free, so that comparison is somewhat moot. I hear this argument quite often when Java vs C++/C is discussed, but it's not comparing idioms/techniques in actual use.

Also don't forget that when GC runs it trashes your d/i-caches; temporaries/garbage allocs reduce your d-cache efficacy; GC must suspend and resume the java threads, which is trips to the kernel scheduler; there are some pathologies with Java threads reaching/detecting safepoints.

GC store barriers (aka card marking) don't have anything to do with thread contention (apart from one thing, which I'll note later). This is a commonly used technique to record old->young gen references, and serves as a way to reduce the set of roots when doing young GC only (i.e. you don't need to scan the entire heap). So this isn't about thread contention, per say -- with the exception that you can get false sharing due to an implementation detail, such as in Oracle's Hotspot.

The card table is an array of bytes. Each heap object's address can be mapped into a byte in this array. Whenever a reference is assigned, Hotspot needs to mark the byte covering the referrer as dirty. The false sharing comes about when different threads end up executing stores where the objects requiring a mark end up mapping to bytes that are on the same cacheline - fairly nasty if you hit this problem as it's completely opaque. So Hotspot has a XX:+UseCondCardMark flag that tries to neutralize this by first checking if the card is already dirty, and if so, skips the mark; as you can imagine, this inserts an extra compare and branch into the existing card marking code - no free lunch.

The idea is, there's a space between "performance doesn't matter" and "so fast it can't use malloc" in the trade-offs of software development. It turns out that space is very large.

"Performance-critical code" can even go in that space in an environment where developer cycles and program safety are things that matter, which is definitely the case in HFT.

Sure, but that space isn't just Java anymore anyway.

Also, what's an (non-toy) environment where developer productivity and safety/correctness don't matter? I always find that statement bizarre when talking about production systems.

No, the GC is a net win from the perspective of code development. The JIT is just one of the things that makes Java not as slow as you'd expect.

As I said, the JVM is an acceptable platform for the slower HFT. That's the kind where a clever predictive strategy matters (maybe with lead time of seconds) and you'll get more money from accurately predicting the future than from shaving off 250us.

Make no mistake - you'll still make money shaving off 250us, but not so much that you want to be bogged down structuring your code the C++ "if we structure it right we won't leak things" way.

You should've made it explicit then that you're referring to slow HFT -- the post I was replying to drew no such distinction apart from saying the "extreme end" uses FPGAs. Obviously if young gen GC pauses aren't an issue, then there's nothing to talk about here but then I'd argue that's not really HFT, although I know the term is quite vague, and is no different than other types of systems. There are other issues with GC and garbage allocations, such as d-cache pollution, but I suppose no need to really discuss them given the type of system you're discussing.

I know you were throwing 250us out there as a pseudo example, but that's actually a very long time even outside of UHFT/MM.

Also don't forget that your trading daemons will be under a fire hose consuming marketdata, so beyond being able to tick-to-trade quickly, you need to be able to consume that stream without building up a substantial backlog (or worse, OOM or enter permanent gapping).