| We did something very similar at EA Playfish, at least one alumni of which is part of the Uber engineering team. We used a 2 column InnoDB-backed table for all of our data storage, massively sharded, and run in a 3-host master-slave-slave configuration. At that time EC2 would routinely kill hosts without the courtesy of a poke via ACPI and as such we became very good at quickly recovering shards. In a nutshell this mechanism was to have the new host contact a backup slave, perform an lvm snap, pipe the compressed snap over a TCP connection, unroll it and carry on, letting replication take up the delta. That enabled us to not only manage the 10 million or so daily active users of that title, but was also the platform under the 12 or so additional titles that studio had. We had lots and lots of very simple things and failures were contained. I think at the time we were the 3rd-largest consumer of EC2 after Netflix and "another" outfit I never learned the name of. EA being what it was, however, we were never permitted to open source a lot of the cool stuff Netflix and ourselves seemed to develop in parallel. |