Hacker News new | ask | show | jobs
by synthetigram 884 days ago
You might be correct, but it's not obvious to me unsafe is really faster. I recall reading that using unsafe invalidates most of the other optimizations because it's opaque to the JIT what you are doing. Also, if there are BCE opportunities, I feel pretty confident OpenJDK will add them, rather than being unsympathetic.
2 comments

In terms of performance: I realize that this is a somewhat "toy" issue, and it's a sample size of 1, but for the currently ongoing "One Billion Row Challenge"[1] (an ongoing Java performance competition related to parsing and aggregating a 13 GB file), all of the current top-performers are using Unsafe. More specifically, the use of Unsafe appears to have been the change for a few entries that allowed getting below the 3-second barrier in the test.

1. https://github.com/gunnarmorling/1brc

Submissions to this challenge are also hampered by the lack of prefetch intrinsic, vpshufb intrinsic, aesenc intrinsic, and graal's complete lack of vector intrinsics. As a total outsider to the Java ecosystem, it makes it seem like nobody in a place to make changes really knows or cares about enabling high-throughput code to run in the JVM.
What do you mean by that?

Also, graal can do some vectorization, and definitely has some libraries with vector intrinsics, e.g. truffle’s regex implementation uses similar algorithms to simdjson.

I think I was fairly specific? There's no way for a user to do vpshufb, aesenc, or a prefetch instruction. One would expect the former two things to be in jdk.incubator.vector, where they are not present. The last thing was explicitly removed, here's a link to one part of the process of it being removed: https://bugs.openjdk.org/browse/JDK-8068977

Code that uses jdk.incubator.vector does not actually get compiled to vector instructions under graal. This is why the top submission that uses jdk.incubator.vector is the only top solution that does not use graal.

I don't doubt that "similar algorithms to simdjson" may be used, but simdjson uses instruction sequences that are impossible to generate using the tools provided.

I mention these instructions because vpshufb comes up a lot in string parsing and formatting, because prefetch is useful for hiding latency when doing bulk accesses to a hash table, and because aesenc is useful for building cheap hash functions such as ahash.

Ah okay, it wasn’t clear that you were talking about explicit control over this through Vector API and similar.
For my solution (https://github.com/dzaima/1brc, uses jdk.incubator.vector significantly), switching all array reads/writes to identical Unsafe ones results in the solution running ~100x slower, so there's certainly truth to Unsafe messing with optimization - I suppose it means that the VM must consider all live objects to potentially have mutated. (as for the sibling comment, indeed, lack of vpshufb among other things can be felt significantly while attempting to use jdk.incubator.vector)
As the parent comment said: bounds checks. Unsafe is an order of magnitude faster for performance sensitive tasks due to only that.

Quick publication showing the difference (up to 125%!): https://www2.cs.arizona.edu/~dkl/Publications/Papers/ics.pdf

(and no, the story hasn't changed an awful lot since 2004. The gap still exists.)

As a non-java developer, this is surprising to me. I always assumed that bounds checks were "almost free" because the branch predictor will get it right 99% of the time. I can see that being not-true for interpreted runtimes (because the branch predictor gets bypassed, essentially), but I thought Java was JITed by default?

Perhaps the paper answers my question, but I'll admit I'm being lazy here and would much appreciate a tl;dr.

Here's an example of the same issue in Rust: https://dropbox.tech/infrastructure/lossless-compression-wit...

(with bounds checks) "Telling the Rust allocator to avoid zeroing the memory when allocating for Brotli improves the speed to 224 MB/s."

(without bounds checks) "Activating unsafe mode results in another gain, bringing the total speed up to 249MB/s, bringing Brotli to within 82% of the C code."

224MB/s -> 249MB/s (11% Brotli compression perf difference just by eliminating bounds checks)

Couple issues with the linked article, the rust compiler has matured a lot since it was written I didn’t look at the benchmark or the code, but many bounce checks are elided if you use iterators instead of array access and lastly, I would want to see ZSTD benchmark written in normative rust.

I don’t doubt that all of those things in the article were true back then, but that was eight years ago. Wow.

Sure, you're not wrong, but my focus was that the benchmark is indicative of the raw performance impact of bounds checks (when they are unable to elided) in a real world algorithm (as opposed to a micro-benchmark).

With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.

It could be an argument that bounds checks make up a small percentage of total application performance. However, I've profiled production Java servers where >50% of the CPU was encryption/compression. JDK implementations of those algorithms are heavily impacted by (and commonly fail to elide) bounds checks.

Performance matters!

>With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.

Adding explicit checks does work to a certain degree but it can change with the compiler, and it requires to keep checking the generated assembly - not fun (no unsafe, either but still)

I understand where you are coming from, but without comparing the generated assembly, we are comparing an implementation of bounds checks. I think we should have a bounds check instruction and operates concurrently.

I should play around this with this using a couple RISC-V cores.

> Performance matters!

https://www.youtube.com/watch?v=r-TLSBdHe1A by Emery Berger

Another, yesbut, encryption and compression are and will handled by on die accelerators.

That's a good example, but much more inline with my expectations (not on the order of 125%)
Keep in mind that the "11%" applies to the fully implemented Brotli compression algorithm. Bounds-checks are a small part of that.

If you write a micro-benchmark for only bounds-checks, you'd see the larger difference more inline with the "125%"

Bounds checks are not 'free' but predicted ones are cheap. There are way to convince the JIT to remove them - in cases it can prove the index doesn't exceed the byte array, so explicit checks with the length of the array may improve the performance.

The stuff gets worse with DirectByteBuffers as the JIT has to work harder. Unsafe allows to 'remove' all bound checks, but it may prevent some other optimizations.

> but it may prevent some other optimizations.

I see this mentioned a few times in this thread, but I haven't experienced this in practice (and I've written a lot of unsafe in Java). Are there any examples of this?

There are tricks you can do, but it introduces load on the TLB. You basically restrict accesses to 32 bits of memory and isolate that space in virtual memory with its own prefix.
Do we know why is that? I would assume the branch predictor is mostly happy with it, is it just the code cache increase?
Perf impact due to bounds checks can be experienced in statically compiled languages as well (like Go/Rust). Although branch predictor improvements likely narrow the gap, they do not eliminate it.

Code cache is not completely relevant afaik - you can easily replicate in micro-benchmarks.

> statically compiled languages

I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.

If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost, besides taking up place in i$, so I would assume that that is responsible for the difference.

> I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.

Correct. I was clarifying that the issue can be replicated in a "simpler" statically compiled benchmark.

> If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost

That isn't true. The CPU branch predictor has a cost no-matter what. Of course, it's a complicated story: https://blog.cloudflare.com/branch-predictor

Thanks! I definitely have to do some readup on that!
> the story hasn't changed an awful lot since 2004

Um... yeah it has. For starters, hotspot wasn't even a part of the JVM at that point. But further newer JVM additions like the enhanced for loop eliminate a ton of conditions where someone would run into bounds checking. Doing a naked `a[i]` is simply not common java code.

The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.

> Doing a naked `a[i]` is simply not common java code.

It is extremely common in performance sensitive code, 1) graphics & rendering 2) networking 3) buffers

> But further newer JVM additions like the enhanced for loop eliminate a ton of conditions > The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.

There are more comments in this thread that clarify further, but Java is very commonly unable to eliminate bounds checks. You can test all of these things yourself with a quick benchmark - don't take my word for it! The JIT is not as great at this as common rhetoric claims it is.

> For starters, hotspot wasn't even a part of the JVM at that point.

HotSpot had been part of the JVM for five years at that point.

https://en.m.wikipedia.org/wiki/HotSpot_(virtual_machine)

I misread the merge date in the wiki. Hotspot merged into mainline in Java 1.3 in 2000. It was first released as a separate distributable.

That said, in the paper they didn't use sun's JVM they used gcj and their own modified version of gcj.

> We examined the performance of both our new Java implementation as well as standard gcj on a variety of Java applications.

So my point still stands, a lot as changed. The researches chose to use static compilation over a JIT or interpreter.

(and even 20 years ago 125% was still almost order of magnitude less than an order of magnitude)
(and that is quite pedantic)
Order of magnitude in casual conversation typically means 10x. I don't think 2x vs 10x qualifies as pedantic.
Oh I don't disagree at all!

It's a writing error on my end (clear since I qualify the percentage afterwards), so I think focusing on it distracts from the point (which is why I say it's pedantic).

I understand the pet peeve though (-:

(it was the best kind of correct: technically correct)
2004 had java 1.4 + hotspot. 1.5 (w/ generics and stuff) was about to be released. Hotspot was there, so were the Direct Buffers.

Also accessing byte arrays and direct buffers is extremely common - if you do just "business logic" jazz - it does not happen, though. However, every hashmap needs that, pretty much each hash lookup is a direct a[hash]