|
|
|
|
|
by ksec
1337 days ago
|
|
>We had a whole load of racks A whole load of Racks for running Basecamp? We are talking about 42U per Rack, and total of 420U of Servers? The scale seems quite massive. At least to the idea / perception of what I had about Basecamp. Would be nice to see those specification and see how much of an improvement it is 10 years later and if we could fit those into 2-3 Racks. |
|
Basecamp started on one single Rackspace server, before I was there. I started at 20 people, left at ~55.
When I left there were I don't remember how many racks exactly, but more than 10, less than 30 in the primary location. 42U of servers in each. There was a mix of a whole load of (Dell, never pay list price!) blade servers, DB appliances (~12 in total across ~6 apps ISTR), Isilon storage[0][1], F5 kit, juniper routers etc. etc. We had some epically fast storage in some of the servers for the time, way faster than SSDs.
Later we added two more sites. One in I think Virginia, one in NY. The one in Virginia was a replica of what we needed to run Basecamp, the one in NY was a half-rack data replication location (I think I got that the right way round). We had 10G fibre (we rented wavelengths not the whole fibre) between each location. We could lose one DC and remain RW for our block data, 2 DCs and we'd have to drop down to RO. Block data was things like uploads, so DBs, search etc. wouldn't have been affected. We could lose one of the /main/ DCs and still be RW for everything.
With all this kit we were able to run both main DCs hot. With our Geo DNS you could hit either of our DCs and you'd get served pages, you could even write to both locations. One DC was always the "RO" DC, it always replicated the databases. If you tried to write to that DC we proxied your request to the RW DC over our 10G links and proxied any more requests you made for n seconds to the RW DC too, at which point we reverted you back to the RO DC.
Now, NY to Virginia isn't that far, so why bother with the hot/hot config? Because it played into the rest of our plan, which was DC failover. With some pretty epic voodoo (Juniper/F5/OpenResty etc.) we could fail over the datacentres, swapping the RO and RW locations. We could also do this if one of the locations was unavailable. We could do this in 4 seconds /without losing a single in-flight request/ (we tested it).
This ended up a bit longer than I was intending, but it illustrates a few things:
- I think people underestimate Basecamp. It's /huge/ (money and users, not employees). Not so much in the tech world (anymore), but even with this kit, even with the (at the time 6) sysadmins that maintained it, it still made a shit load of cash. I guestimated the net-worth of the two owners as in the hundreds of millions of dollars each, entirely because of basecamp. Cash that as a privately owned company all went to the owners (who gave some of it to us, they treated us fairly well). I think it was Patio11 who said that people under estimated the market for software, it's easy to do, these aren't human-scale numbers.
- The owners are right about keeping it private. You might not make "larger yacht than Larry Ellison" money, but you sure as hell have a lot higher chance of making "pretty damn big yacht and not having to work again" sort of money. If I was rolling the dice I know which I'd gamble on.
- I don't think I ever actually calculated it, but the amount of money our servers were worth while running was /immense/. When you went to the DC and looked at the row of racks, you could kinda see just how dense the $ value there was. The efficiencies that the cloud brings gives money to AWS, not to the clients.
- All of this was still /way/ cheaper than running cloud infra. I'm not against the cloud (I use it), but we made more because we had this infra than if we were on the cloud. We were dense when it came to # customers per $ spent on infra.
- The flexibility we had because we controlled everything was also extraordinary. It came at a cost, we had to do it ourselves, but we had a /lot/ of power to make things work the way we wanted to.
- We had access to everything, from the network (hell, even the light :) ) to the JS. This gave us optimisation options not available to a lot of people. Network topologies, buying the right CPUs (we benchmarked them), configuring the CPUs, etc. etc.
[0] An Isilon storage engineer once dropped one of these on the floor, while it was powered on and the drives were running. It took out a metal floor tile. [1] They were a PITA, everyone hated them. There was a running joke among the sysadmins that when we decommissioned them we would take one on an Ops meet and use it as target practice.