Hacker News new | ask | show | jobs
by mjpt777 5349 days ago
Single nodes can die in the system without issue. They often do! Since we use IP multicast the network failure is transparent as a replica takes up the primary role.

The one issue to be managed with this type of system is exceptions in the business logic thread. This can be handled via a number of prevention techniques. First, apply very strict validation on all input parameters. Second, take a test driven approach to development; at LMAX we have 10s of thousands of automated tests. Third, code so methods are either idempotent, or changes are only applied at the end when the business logic is complete using local variables. With this combination of approaches we have not seen a production outage due to business logic thread failure in over a year of live operation.