|
|
|
|
|
by Yoric
2327 days ago
|
|
For Session Restore, there were two problems. One of them was that it had gained lots of features and the architecture wasn't adapted anymore. That's the kind of things that happens with all long-lived projects. The other one is that getting file safety correctly is really hard. The OS likes to the developer, the filesystem lies to the OS and the hardware lies to everyone. For most applications, that's not a big deal, but for something that used to run every 15 seconds on hundreds of millions of computers, this can cause data loss. That's why you really want to use a DBMS rather than roll your own format if data safety is critical. For shutdown, it was also a case of initial architecture not matching the current situation anymore. Firefox was initially a synchronous, single-threaded, single-process architecture, but this had stopped being the case for a few years already. At some point, shutdown needed to be re-architectured, I happened to be the one who managed to convince people that the time was now :) |
|
That reminds me of the case where AWS famously revealed that a single NIC in their ginaromous S3 fleet flipped a single bit once in a while and that caused an outage because their gossip-daemon responsible for fleet health-checks failed spectacularly [2].
Bryan Cantrill's talk on the realted topic of hardware/firmware bugs is pretty good [3].
[0] https://danluu.com/fsyncgate/
[1] https://perspectives.mvdirona.com/2017/04/at-scale-rare-even...
[2] https://youtube.com/watch?v=swQbA4zub20&t=46m02s
[3] https://youtube.com/watch?v=fE2KDzZaxvE