Hacker News new | ask | show | jobs
by colin_mccabe 3949 days ago
Go already gives better tools than Java for managing native memory through cgo, which is a far less painful interface than JNI (believe me, I've written a lot of JNI for Hadoop). Go also has value types which is a huge win for managing memory. (And even if Java gets value types later, the whole Java standard library will still take years to change to use them, if it ever does.)

Plus, sun.misc.Unsafe is probably going away, according to Oracle. http://www.infoq.com/news/2015/07/oracle-plan-remove-unsafe

From what I've heard, Azul has a great GC, but the throughput is extremely low. It's really only a practical solution for high frequency finance and places like that where latency is everything, and throughput is nothing (can buy another 100 servers or high end hardware.) Note: I'm talking about their software product which runs on vanilla hardware, not their hardware product, which I understand is far superior.

A lot of people on HN also seem to be taking the statement that maximum GC latency will be 10ms as a statement that there will often be 10ms pauses. Hopefully, the average latency will be far less, in the 1 or 2ms, and 10ms will be something that only happens on huge heaps in certain conditions. This should be similar to what has happened on Android, where GC pauses are pretty rare and typically only 1 or 2ms.

3 comments

> the whole Java standard library will still take years to change to use them, if it ever does.

All Java collections (which are what really matters) are being retrofitted alongside with the introduction of value types. As soon as value types are introduced (and the HotSpot team is working hard on that right now), all collections will be fully value-ready.

> Go already gives better tools than Java for managing native memory through cgo, which is a far less painful interface than JNI

JNI is being replaced by Project Panama: http://openjdk.java.net/projects/panama/ (you can already use a similar FFI already with JNR, which serves as a blueprint for Panama's FFI: https://github.com/jnr I've used JNR to write a FUSE filesystem in Java without a line of C, and unlike JNA, it's fast!)

Besides, HotSpot runs C now, too, and quite well:

http://www.chrisseaton.com/rubytruffle/cext/

https://dl.dropboxusercontent.com/u/292832/useR_multilang_de...

> Plus, sun.misc.Unsafe is probably going away, according to Oracle.

... only to be replaced by something much better: https://www.youtube.com/watch?v=ycKn18LtNtk (Unsafe isn't going away until replacements are available).

> From what I've heard, Azul has a great GC, but the throughput is extremely low.

Not at all. Just a little lower than with HotSpot's throughput collector, and possibly higher than with G1 (although G1 changes a lot, so that might not be true).

It feels weird to say this, but Oracle's stewardship of the jvm is making me really hopeful as an occasional ml developer.

Consider the wish list: I want a garbage collected language where, for a handful of large/important data structures, I can sidestep gc and carefully control memory layouts for cache friendliness. I'd also like direct interop with blas and my aforementioned data structures.

It looks like I may get all of this!

And yes, I've done a bunch of work with misc.unsafe but it's nowhere near as nice as it could be. What the jvm really buys you is not having to build once for each platform; I distributed code that relied on c++11 features on 3 platforms while there was mixed compiler support and it was a bloody nightmare.

> I can sidestep gc and carefully control memory layouts for cache friendliness

Memory layout and GC are two completely orthogonal issues. You will be able to control memory layout quite well with Valhalla (value types) and even on a finer-grained level with Panama if you need C interoperability. VarHandles (hopefully in Java 9) will give you safe access to off-heap memory. Currently you can do that with Unsafe, which is more work but still less than C++.

> What the jvm really buys you is not having to build once for each platform

Oh, I'd say it buys you a lot more: seamless polyglotism, exceptional performance even for dynamic stuff (dynamic languages, esp. w/ Graal, but even cool bytecode manipulation in Java or even simple code loading/swapping), and you get all that performance with unprecedented observability into the running platform.

Value types will provide ability to allocate storage embedded in heap object or stack, but it doesn't provide layout control (i.e. order of fields in the layout). It's a good change, but let's not exaggerate.
As the requirement was "layout control for cache friendliness" value types are all you need (or 99.99% of what you can possibly need). For interop, there's Panama. Let's not nitpick.
99.99% is perhaps your estimate, but not necessarily others. This is also not likely what people would consider "layout control" if they're coming from a language that allows field-level layout control. Being able to place frequently used together fields manually is quite useful in quite a few circumstances.
I think I disagree about the observability -- vtune is a lot easier to use when just tuning straight C++ rather than java
Are you familiar with JMH's perfasm? http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html

And for profiling apps on production, I've yet to encounter a more thorough, low-overhead profiler than Java Flight Recorder.

no but i'll check it out this afternoon. Thanks!
VTune is not too dissimilar to JProfiler.

And as pron mentioned plenty of tools exist for lower level access.

Vtune gives access to PMU counters as well as attributing them to assembly. JProfiler is a purely java level profiler (it won't even tell you Hotspots in the JVM itself, nevermind assembly). They're not really comparable.
jprofiler, at least for my use cases, isn't really similar to vtune at all. I know what my hot spots are: it's the inner bits of algorithms that run a few billion to a few trillion times. What I need to do is understand, as granularly as possible, the exact instructions and how the various caches and memory are operating. Convex and tree optimizers are generally memory speed limited and my goal is to have this code run at eg 0.9+ of memory b/w speed.
Which of the JNR projects is the one you used? There are three that sound like they would do the exact same thing:

https://github.com/jnr/jnr-invoke https://github.com/jnr/jnr-ffi https://github.com/jnr/jffi

jnr-ffi is the high-level API (and the one I used). jffi is a low-level ffi used by jnr-ffi. Unfortunately, jnr-ffi is still missing one piece (which can be done with the low-level FFI), namely returning creating a function pointer to a Java method dynamically and passing it to C code.
I don't know if Azul Zing throughput is "extremely low" but it does sacrifice it for latency. I had a discussion with Gil Tene on this recently where he dove into some details: https://groups.google.com/forum/m/#!topic/mechanical-sympath...

In particular, the LVB is akin to an array range check on each reference read.

> "...From what I've heard, Azul has a great GC, but the > throughput is extremely low. It's really only a practical > solution for high frequency finance and places like that > where latency is everything, and throughput is nothing..."

Well, you heard wrong.

Zing is used in plenty of throughput-intesive and throughout-centric applications, and sustainable throughput on Zing tends to be higher (not lower) than with other JVMs. E.g. Cassandra clusters tend to carry higher production loads per machine when powered by Zing (compared to OpenJDK or HotSpot on the same hardware). All while dramatically improving their latency behavior and consistency.

Specifically, on similarly sized heaps and workloads, the C4 collector's throughout is better than CMS's and close to ParallelGC's. And since it's throughput scales linearly with the amount of empty heap configured and since (unlike OpenJDK/Hotspot) Zing places no practical pause-related caps on how much memory can be applied, it tends to beat both on efficiency in actual configurations.

The notion that good latency behavior has to come at the expense of throughput is just a silly myth. There are plenty of examples that disprove it. Zing/C4 is just one of many.

I thought Zing was basically HotSpot licensed and modified to use C4 (plus a few other minor things). How comes it's faster than HotSpot overall? Are there a lot of Azul-specific compiler optimisations there too?
- Zing is based on HotSpot, and it's biggest visible change is C4, but it changes a lot more than just the collector. E.g. it addresses pretty much all the causes of JVM glitches. (You can see a discussion of the many other reasons JVMs pause here: https://www.youtube.com/watch?v=Y39kllzX1P8).

- The reason Zing tends to to carry higher throughput in production is that in most Java-based systems, production throughput levels are limited not by system capacity, but by how far you can drive the JVMs before the glitches start being unbearable, and what looks like occasional small hiccups at lower throughputs starts looking more like epileptic seizures at load. Capacity planning and sizing usually aim to keep peak production loads below the levels that lead to these "I don't want to go there" behaviors. By taking out the various glitching/pausing/stalling behaviors typically associated with JVMs under load, Zing extends the smooth operating range such that it comes much closer to the traditional "how much can this hardware handle?" capacity and sizing behavior people are used to in non-Java and non-GC'ed environments.

- When you compare raw throughput or speed (with no SLAs, e.g. "how long does this multi-hour batch job take to complete?"), with similar configurations Zing is usually comparable to OpenJDK/HotSpot [Where comparable typically means within +/-15-20% range. Sometimes faster, sometimes slower.] But once people apply simple knob twists in the applications (like turning up heap sizes, caches, and other using-memory-no-longer-hurts related settings) they often get more raw throughout per instance or machine through simple efficiency benefits (like the elimination of raw work that comes from higher in-process, on-heap cache hit rates).

Thanks, fascinating answer.