Hacker News new | ask | show | jobs
by x0x0 3949 days ago
It feels weird to say this, but Oracle's stewardship of the jvm is making me really hopeful as an occasional ml developer.

Consider the wish list: I want a garbage collected language where, for a handful of large/important data structures, I can sidestep gc and carefully control memory layouts for cache friendliness. I'd also like direct interop with blas and my aforementioned data structures.

It looks like I may get all of this!

And yes, I've done a bunch of work with misc.unsafe but it's nowhere near as nice as it could be. What the jvm really buys you is not having to build once for each platform; I distributed code that relied on c++11 features on 3 platforms while there was mixed compiler support and it was a bloody nightmare.

1 comments

> I can sidestep gc and carefully control memory layouts for cache friendliness

Memory layout and GC are two completely orthogonal issues. You will be able to control memory layout quite well with Valhalla (value types) and even on a finer-grained level with Panama if you need C interoperability. VarHandles (hopefully in Java 9) will give you safe access to off-heap memory. Currently you can do that with Unsafe, which is more work but still less than C++.

> What the jvm really buys you is not having to build once for each platform

Oh, I'd say it buys you a lot more: seamless polyglotism, exceptional performance even for dynamic stuff (dynamic languages, esp. w/ Graal, but even cool bytecode manipulation in Java or even simple code loading/swapping), and you get all that performance with unprecedented observability into the running platform.

Value types will provide ability to allocate storage embedded in heap object or stack, but it doesn't provide layout control (i.e. order of fields in the layout). It's a good change, but let's not exaggerate.
As the requirement was "layout control for cache friendliness" value types are all you need (or 99.99% of what you can possibly need). For interop, there's Panama. Let's not nitpick.
99.99% is perhaps your estimate, but not necessarily others. This is also not likely what people would consider "layout control" if they're coming from a language that allows field-level layout control. Being able to place frequently used together fields manually is quite useful in quite a few circumstances.
> Being able to place frequently used together fields manually is quite useful in quite a few circumstances.

I have almost never seen this make a difference outside of, say, GPU programming. The fact that Java's optimizer is much better than that of Go will make a much larger difference in execution speed.

This is mostly an issue for large objects (i.e. span multiple cache lines), but have different access patterns for various fields (i.e. clusters of fields accessed together).

The other aspect of layout control is cacheline padding, which is also not present in the JVM. There's @Contended, but it's a blunt tool and not currently a public API (it's in sun.misc).

>The fact that Java's optimizer is much better than that of Go will make a much larger difference in execution speed

Yes, but that's orthogonal.

But value types do let you group fields (with sub-component values).

Also, I've noticed that whenever I say Java does X, you say, "Oh, no! It does X - ε!" Now, to me, that's nitpicking, especially considering that a perfect general-purpose language/runtime designed to be simple (for some definition of simple) should give you 90+% performance in 99% of general-purpose use cases (or 95% in 95% etc.). If it does any better then one of two possibilities is true: 1/ it's magic, or 2/ it's not a perfect simple language/runtime because it could have been made simpler (by whatever definition of simple it's chosen).

Anyone who can't settle for anything less than 100% performance or does something that's outside 95% of the use cases knows not to use such a general-purpose language/runtime, and, instead, uses a more domain-specific language/runtime or one that's not designed to be simple.

>But value types do let you group fields (with sub-component values).

They let you treat the fields of a value type as a "blob", you have no control over how they're laid out within that blob.

>Also, I've noticed that whenever I say Java does X, you say, "Oh, no! It does X - ε!" Now, to me, that's nitpicking, especially considering that a perfect general-purpose language/runtime designed to be simple (for some definition of simple) should give you 90+% performance in 99% of general-purpose use cases (or 95% in 95% etc.). If it does any better then one of two possibilities is true: 1/ it's magic, or 2/ it's not a perfect simple language/runtime because it could have been made simpler (by whatever definition of simple it's chosen).

Nothing personal, but I find your JVM related posts as borderline fanboyism (and I say this as someone that greatly respects the engineering in Hotspot, despite certain things bugging me). It's not about 100% or 95% performance; it's about not making exaggerated claims since, as you say, there's no magic.

X - ε, for very large values of ε.
I think I disagree about the observability -- vtune is a lot easier to use when just tuning straight C++ rather than java
Are you familiar with JMH's perfasm? http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html

And for profiling apps on production, I've yet to encounter a more thorough, low-overhead profiler than Java Flight Recorder.

no but i'll check it out this afternoon. Thanks!
VTune is not too dissimilar to JProfiler.

And as pron mentioned plenty of tools exist for lower level access.

Vtune gives access to PMU counters as well as attributing them to assembly. JProfiler is a purely java level profiler (it won't even tell you Hotspots in the JVM itself, nevermind assembly). They're not really comparable.
jprofiler, at least for my use cases, isn't really similar to vtune at all. I know what my hot spots are: it's the inner bits of algorithms that run a few billion to a few trillion times. What I need to do is understand, as granularly as possible, the exact instructions and how the various caches and memory are operating. Convex and tree optimizers are generally memory speed limited and my goal is to have this code run at eg 0.9+ of memory b/w speed.