Hacker News new | ask | show | jobs
by hft_throwaway 4430 days ago
The code was QAed, but they didn't test old and new versions against each other. Version A could accept a flag and run obsolete logic that would lose control of its orders but never sent it, so this problem never happened. Version B sent this flag and the receiver would send RPI orders with it. Put a Version B sender and a Version A receiver together and you end up with a disaster.

From a systems perspective, my takeaways on this are:

-Don't re-use a message for a semantically different purpose in a distributed system where you're running different software versions (even in cases where you don't plan to, really, since you may roll back or end up running the wrong code by mistake)

-Version your messages so anything that changes their meaning can only be accepted by a receiver that follows that protocol

-QA old and new builds against one another

If you really want to look at the root cause of this, it's cultural. Trading desks don't want to spend development time on things that don't generate PnL. Traders want to try lots of ideas so many features are built that don't get used. Code cleanup gets put on the back burner. Developers do sketchy stuff like re-purposing a message field because it's annoying or time-consuming to deploy a new format. If traders aren't developers themselves, they may underestimate the risk of pressuring operations & devs to work more quickly.

Things like this are probably the biggest risk faced by automated traders, and the good shops take it very seriously. I've never been scared of any loss due to poor trading, but losses due to software errors can be astonishing and happen faster than you can stop them.