Hacker News new | ask | show | jobs
by kitd 2744 days ago
They have a large pool of developers working in HFT etc, which has a history of using Java.
1 comments

That always amazed me. Why people keep torturing themselves applying Java to the usecase where Java is clearly not a right technology. I've watched very interesting talk by LMAX people and half of it was about how to overcome garbage collection gaps, latency, etc.
I've researched this some a few months back. I don't blame you for saying that Java is wrong for HFT. My first reaction was the same, but there's much more to it.

Java has a few things going for it in HFT. The obvious pluses are it's mature and memory safe. What's less obvious is that you can make it low-latency. It takes a lot of work, but it's doable, at which point you have all the nice things: mature ecosystem, speed, latency, safety. It takes a lot of work, because Java was always oriented towards server use cases, as in high-throughput, not low-latency. That's changing by the way, there are two new GC engines coming out that are low-latency oriented. Also, there's been third-party JVMs with low-latency guarantees for quite a while.

Of course, what's between the lines is that there isn't any easy answers for HFT people. You either choose mature safety and do gc gymnastics (because everything is throughput oriented), or you choose manual memory management, which is its own gymnastics.

Anyway, that's my take. I welcome input and contradictions.

Mostly I agree, but there is one factor that makes C++ (or any native compiled language) better than java for HFT: the ability to lie to the optimizer about what the hot path is. in HFT you have thousands of no trades for every trade, so the java optimizer will optimize the no-trade code path as more likely, then when the trade happens java pays a CPU branch prediction miss penalty at the only time low latency matters.

You have to have good algorithms optimized to the max for this to matter though.

That’s indeed a problem, but it can be mitigated.

If you have a low-latency trading component written in Java, a common trick is to continuously bombard it with ‘fake’ inputs to keep the desired code paths nice and hot.

The fake inputs should be virtually indistinguishable from real ones that you would normally act on. The more subtle the difference, the better, e.g., just flip the sign on the timestamp field.

You can use that subtle difference to pick whether the order goes out to the real exchange or a fake exchange. The decision needs to avoid actual branching instructions, though, or the JVM will likely optimize out the ‘real’ hot path, and you’ll fall back to interpreted mode when an actionable ‘real’ event comes in. I usually use a branchless selector to index into an array ([0] goes to a real socket, [1] goes to a black hole socket).

You can also use this technique to make sure you can respond to very rare events quickly. For example, you may want to respond to news signals from Bloomberg. Actionable news is rare, so if you want to keep your news parsing/analysis code warmed up and in the cache, it needs to constantly be reacting to warmup data.

Interesting. Have you ever introduced an error in an attempt to lower latencies with this method?
Thankfully, no (knocks on wood). But you have to base your design around the idea that real and fake trades are indistinguishable until the last possible moment. That critical requirement needs to always be in your mind.

I would never try to bolt those kinds of optimizations onto an existing system. It’d be too easy to miss something.

I was with you until you mentioned branch prediction... isn't branch prediction a hardware feature? How do you trick the HW branch predictor into predicting the unlikely case?
The cpu still needs to load code in via instruction cacheline fetches. For every instruction fetch, that core isn't doing much.

The compiler alleviates this somewhat by putting the hot path right under the branch instruction so that the fetch that grabs the branch also grabs the start of the hot path as part of the same cacheline.

It sounds minimal, but if that fetch is swapped out of L2 cache due to long periods of inactivity, it can take upwards of 100ns, which starts to add up in HFT.

Yes it is a hardware feature. However the hardware can be given hints as to which branch is more likely. This is generally documented by the manufacturer, in one of those technical documents aimed at compiler writers.

With profile guided optimization it is possible for the compiler to have much better information about branches than the CPU can guess. Java applies profile guided optimization in real time, with C++ it much more complex to apply.

> However the hardware can be given hints as to which branch is more likely.

I don't think that is the case for modern (last 8 years or so) Intel processors. For example, I'm under the impression that gcc's __builtin_expect only affects the layout of the generated code. However I'd love to learn something new here; do you have a source or any additional info you could share?

I think you're discussing two different phases of optimization. PGO and Java's JIT use branch information to emit different machine code. Hardware branch prediction takes machine code and determines which branches in the machine code are taken, and speculates based on that information. There's an underlying pattern that both follow, but they're very different.
That's interesting, never thought about that.
Unless you write HFT code, or follow talks by those writing HFT code you probably wouldn't. When you write HFT code you have to look at profiles and think about cache misses until branch missed become something to consider. If you don't come up with that idea someone else will and they will beat you to every trade and put you out of business.
So, if I want to manipulate the market I need to work out (or poison) the hot path and have a system that's tailored to being faster on profitable - if colder - paths?
I have been out of the HFT space for many years now, but your conclusions make sense. Here is my own take on the subject from my short tenure at an HFT firm: https://news.ycombinator.com/item?id=12053159
One important thing to know is that allocating memory in C or C++ has high and/or unpredictable latency relative to the target latency of HFT code. So for critical paths, you end up needing to do the same kind of pre-allocation tricks in C/C++ that you would need to do in Java. There is some benefit to being able to write more natural code in the non-critical paths, but that has to be weighed against the overall development advantages that lead people to choose Java over C/C++ in other industries.
C and C++ (and Rust!) can put objects on the stack, so you have a lot more leeway to write normal idiomatic code without hitting the allocator.

When Java gets value objects, this sort of work will begin to get a lot easier in Java as well, but there will be a lot of catching-up to do.

A very good point. Most of my GC whispering is done in .NET, which also has value types, but I was guessing that escape analysis in Java would solve most of the problems in this regard. Is that not actually the case?
The answer: development time and runtime safety. You don't want your HFT system to blow up with a seg fault when the stock exchange is crashing.
How would you weigh the two against each other in terms of importance? Could something like Rust be used to avoid the latter?
The most important factor is how well can your language be optimized. HFT is about winner takes all. If the rust optimizer is even slightly worse than your competitors language you will make nothing. Development speed might get your a faster algorithm for a few weeks but your competition will notice you making that money and will catch up despite the slower pace of development and then their faster language will make the difference and you lose all future trades.
We use LLVM, so we have the same optimizer as clang.
llvm is not known as the best optimizer though. (but benchmarks tend to lie and llvm is always pretty close). There are also subtle areas where the front end can generate code that the backend cannot optimizer as well (though given equal effort I'd expect this advantage to go to newer languages that are design for modern optimizers - but effort is not equal with C++ getting for more love)
Integration. At some level the code has to integrate with other systems and in banks they tend to be Java systems.
In trading, correctness is still a lot more important than latency. Using Java instead is about risk avoidance, which is why you also see people using OCaml.