Hacker News new | ask | show | jobs
by sam_pointer 3622 days ago
We did something very similar at EA Playfish, at least one alumni of which is part of the Uber engineering team.

We used a 2 column InnoDB-backed table for all of our data storage, massively sharded, and run in a 3-host master-slave-slave configuration.

At that time EC2 would routinely kill hosts without the courtesy of a poke via ACPI and as such we became very good at quickly recovering shards. In a nutshell this mechanism was to have the new host contact a backup slave, perform an lvm snap, pipe the compressed snap over a TCP connection, unroll it and carry on, letting replication take up the delta.

That enabled us to not only manage the 10 million or so daily active users of that title, but was also the platform under the 12 or so additional titles that studio had.

We had lots and lots of very simple things and failures were contained.

I think at the time we were the 3rd-largest consumer of EC2 after Netflix and "another" outfit I never learned the name of. EA being what it was, however, we were never permitted to open source a lot of the cool stuff Netflix and ourselves seemed to develop in parallel.

1 comments

this is frikking awesome! do you have any of the lvm and pipe scripts publicly available ? i'm kinda struggling with building and setting up lvm on ec2 automatically and was wondering if there is any tidbits you can pass along.
Unfortunately not. That company and codebase is very much dead, and I don't own any of it. Even if I did it would be a bunch of context-less shell (in the first iteration) and then a bunch of context-less Chef (in the second platform iteration).

Things I do remember:

- We used ephemeral volumes for all data stores. This was pre-provisioned IOPs and EBS was flaky as heck back then. I can't remember the disk layout, although we did experiment a great deal.

- We took great pains to ensure there was enough space to make the snapshot (IIRC is was a telemetry/monitoring item)

- The pipe scripts were essentially "netcat".

The best I can offer is this talk: http://vimeo.com/57861199