Hacker News new | ask | show | jobs
by iscoelho 888 days ago
As the parent comment said: bounds checks. Unsafe is an order of magnitude faster for performance sensitive tasks due to only that.

Quick publication showing the difference (up to 125%!): https://www2.cs.arizona.edu/~dkl/Publications/Papers/ics.pdf

(and no, the story hasn't changed an awful lot since 2004. The gap still exists.)

4 comments

As a non-java developer, this is surprising to me. I always assumed that bounds checks were "almost free" because the branch predictor will get it right 99% of the time. I can see that being not-true for interpreted runtimes (because the branch predictor gets bypassed, essentially), but I thought Java was JITed by default?

Perhaps the paper answers my question, but I'll admit I'm being lazy here and would much appreciate a tl;dr.

Here's an example of the same issue in Rust: https://dropbox.tech/infrastructure/lossless-compression-wit...

(with bounds checks) "Telling the Rust allocator to avoid zeroing the memory when allocating for Brotli improves the speed to 224 MB/s."

(without bounds checks) "Activating unsafe mode results in another gain, bringing the total speed up to 249MB/s, bringing Brotli to within 82% of the C code."

224MB/s -> 249MB/s (11% Brotli compression perf difference just by eliminating bounds checks)

Couple issues with the linked article, the rust compiler has matured a lot since it was written I didn’t look at the benchmark or the code, but many bounce checks are elided if you use iterators instead of array access and lastly, I would want to see ZSTD benchmark written in normative rust.

I don’t doubt that all of those things in the article were true back then, but that was eight years ago. Wow.

Sure, you're not wrong, but my focus was that the benchmark is indicative of the raw performance impact of bounds checks (when they are unable to elided) in a real world algorithm (as opposed to a micro-benchmark).

With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.

It could be an argument that bounds checks make up a small percentage of total application performance. However, I've profiled production Java servers where >50% of the CPU was encryption/compression. JDK implementations of those algorithms are heavily impacted by (and commonly fail to elide) bounds checks.

Performance matters!

>With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.

Adding explicit checks does work to a certain degree but it can change with the compiler, and it requires to keep checking the generated assembly - not fun (no unsafe, either but still)

These explicit checks are basically what I mean by "convincing the compiler" (and sometimes, it isn't quite convinced!) - Yeah. Not fun at all.

Especially in Java, because "the assembly" can change as the JIT evolves. What is optimized today may not be tomorrow.

I understand where you are coming from, but without comparing the generated assembly, we are comparing an implementation of bounds checks. I think we should have a bounds check instruction and operates concurrently.

I should play around this with this using a couple RISC-V cores.

> Performance matters!

https://www.youtube.com/watch?v=r-TLSBdHe1A by Emery Berger

Another, yesbut, encryption and compression are and will handled by on die accelerators.

That's a good example, but much more inline with my expectations (not on the order of 125%)
Keep in mind that the "11%" applies to the fully implemented Brotli compression algorithm. Bounds-checks are a small part of that.

If you write a micro-benchmark for only bounds-checks, you'd see the larger difference more inline with the "125%"

Bounds checks are not 'free' but predicted ones are cheap. There are way to convince the JIT to remove them - in cases it can prove the index doesn't exceed the byte array, so explicit checks with the length of the array may improve the performance.

The stuff gets worse with DirectByteBuffers as the JIT has to work harder. Unsafe allows to 'remove' all bound checks, but it may prevent some other optimizations.

> but it may prevent some other optimizations.

I see this mentioned a few times in this thread, but I haven't experienced this in practice (and I've written a lot of unsafe in Java). Are there any examples of this?

There are tricks you can do, but it introduces load on the TLB. You basically restrict accesses to 32 bits of memory and isolate that space in virtual memory with its own prefix.
Do we know why is that? I would assume the branch predictor is mostly happy with it, is it just the code cache increase?
Perf impact due to bounds checks can be experienced in statically compiled languages as well (like Go/Rust). Although branch predictor improvements likely narrow the gap, they do not eliminate it.

Code cache is not completely relevant afaik - you can easily replicate in micro-benchmarks.

> statically compiled languages

I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.

If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost, besides taking up place in i$, so I would assume that that is responsible for the difference.

> I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.

Correct. I was clarifying that the issue can be replicated in a "simpler" statically compiled benchmark.

> If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost

That isn't true. The CPU branch predictor has a cost no-matter what. Of course, it's a complicated story: https://blog.cloudflare.com/branch-predictor

Thanks! I definitely have to do some readup on that!
> the story hasn't changed an awful lot since 2004

Um... yeah it has. For starters, hotspot wasn't even a part of the JVM at that point. But further newer JVM additions like the enhanced for loop eliminate a ton of conditions where someone would run into bounds checking. Doing a naked `a[i]` is simply not common java code.

The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.

> Doing a naked `a[i]` is simply not common java code.

It is extremely common in performance sensitive code, 1) graphics & rendering 2) networking 3) buffers

> But further newer JVM additions like the enhanced for loop eliminate a ton of conditions > The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.

There are more comments in this thread that clarify further, but Java is very commonly unable to eliminate bounds checks. You can test all of these things yourself with a quick benchmark - don't take my word for it! The JIT is not as great at this as common rhetoric claims it is.

> For starters, hotspot wasn't even a part of the JVM at that point.

HotSpot had been part of the JVM for five years at that point.

https://en.m.wikipedia.org/wiki/HotSpot_(virtual_machine)

I misread the merge date in the wiki. Hotspot merged into mainline in Java 1.3 in 2000. It was first released as a separate distributable.

That said, in the paper they didn't use sun's JVM they used gcj and their own modified version of gcj.

> We examined the performance of both our new Java implementation as well as standard gcj on a variety of Java applications.

So my point still stands, a lot as changed. The researches chose to use static compilation over a JIT or interpreter.

(and even 20 years ago 125% was still almost order of magnitude less than an order of magnitude)
(and that is quite pedantic)
Order of magnitude in casual conversation typically means 10x. I don't think 2x vs 10x qualifies as pedantic.
Oh I don't disagree at all!

It's a writing error on my end (clear since I qualify the percentage afterwards), so I think focusing on it distracts from the point (which is why I say it's pedantic).

I understand the pet peeve though (-:

(it was the best kind of correct: technically correct)
2004 had java 1.4 + hotspot. 1.5 (w/ generics and stuff) was about to be released. Hotspot was there, so were the Direct Buffers.

Also accessing byte arrays and direct buffers is extremely common - if you do just "business logic" jazz - it does not happen, though. However, every hashmap needs that, pretty much each hash lookup is a direct a[hash]