Hacker News new | ask | show | jobs
by ksec 1337 days ago
>We had a whole load of racks

A whole load of Racks for running Basecamp? We are talking about 42U per Rack, and total of 420U of Servers?

The scale seems quite massive. At least to the idea / perception of what I had about Basecamp. Would be nice to see those specification and see how much of an improvement it is 10 years later and if we could fit those into 2-3 Racks.

1 comments

*whoops, this got long*

Basecamp started on one single Rackspace server, before I was there. I started at 20 people, left at ~55.

When I left there were I don't remember how many racks exactly, but more than 10, less than 30 in the primary location. 42U of servers in each. There was a mix of a whole load of (Dell, never pay list price!) blade servers, DB appliances (~12 in total across ~6 apps ISTR), Isilon storage[0][1], F5 kit, juniper routers etc. etc. We had some epically fast storage in some of the servers for the time, way faster than SSDs.

Later we added two more sites. One in I think Virginia, one in NY. The one in Virginia was a replica of what we needed to run Basecamp, the one in NY was a half-rack data replication location (I think I got that the right way round). We had 10G fibre (we rented wavelengths not the whole fibre) between each location. We could lose one DC and remain RW for our block data, 2 DCs and we'd have to drop down to RO. Block data was things like uploads, so DBs, search etc. wouldn't have been affected. We could lose one of the /main/ DCs and still be RW for everything.

With all this kit we were able to run both main DCs hot. With our Geo DNS you could hit either of our DCs and you'd get served pages, you could even write to both locations. One DC was always the "RO" DC, it always replicated the databases. If you tried to write to that DC we proxied your request to the RW DC over our 10G links and proxied any more requests you made for n seconds to the RW DC too, at which point we reverted you back to the RO DC.

Now, NY to Virginia isn't that far, so why bother with the hot/hot config? Because it played into the rest of our plan, which was DC failover. With some pretty epic voodoo (Juniper/F5/OpenResty etc.) we could fail over the datacentres, swapping the RO and RW locations. We could also do this if one of the locations was unavailable. We could do this in 4 seconds /without losing a single in-flight request/ (we tested it).

This ended up a bit longer than I was intending, but it illustrates a few things:

- I think people underestimate Basecamp. It's /huge/ (money and users, not employees). Not so much in the tech world (anymore), but even with this kit, even with the (at the time 6) sysadmins that maintained it, it still made a shit load of cash. I guestimated the net-worth of the two owners as in the hundreds of millions of dollars each, entirely because of basecamp. Cash that as a privately owned company all went to the owners (who gave some of it to us, they treated us fairly well). I think it was Patio11 who said that people under estimated the market for software, it's easy to do, these aren't human-scale numbers.

- The owners are right about keeping it private. You might not make "larger yacht than Larry Ellison" money, but you sure as hell have a lot higher chance of making "pretty damn big yacht and not having to work again" sort of money. If I was rolling the dice I know which I'd gamble on.

- I don't think I ever actually calculated it, but the amount of money our servers were worth while running was /immense/. When you went to the DC and looked at the row of racks, you could kinda see just how dense the $ value there was. The efficiencies that the cloud brings gives money to AWS, not to the clients.

- All of this was still /way/ cheaper than running cloud infra. I'm not against the cloud (I use it), but we made more because we had this infra than if we were on the cloud. We were dense when it came to # customers per $ spent on infra.

- The flexibility we had because we controlled everything was also extraordinary. It came at a cost, we had to do it ourselves, but we had a /lot/ of power to make things work the way we wanted to.

- We had access to everything, from the network (hell, even the light :) ) to the JS. This gave us optimisation options not available to a lot of people. Network topologies, buying the right CPUs (we benchmarked them), configuring the CPUs, etc. etc.

[0] An Isilon storage engineer once dropped one of these on the floor, while it was powered on and the drives were running. It took out a metal floor tile. [1] They were a PITA, everyone hated them. There was a running joke among the sysadmins that when we decommissioned them we would take one on an Ops meet and use it as target practice.

That is about the amazon.com architecture as late as 2006 (only slightly higher scale of three entire datacenters scattered around virginia and I think dual 40G links between datacenters, but other than that exactly the same principles).

Lot of people think that DC failover needs to be east coast / west coast but you can achieve most of your DC redundancy goals separated by 100 miles or less and have a lot lower latency, higher bandwidth and lower costs. Might want to think about different geographic flood plains and different power companies / grids.

Could still nerd fight about EMPs from nuclear war and a sufficiently massive hurricane or an earthquake out here on the west coast, but at some point you need to accept some risks.

The reason was that I remember DHH stating they were doing 2K RPS with 30 App Server in 2015 ~ 2016. And they were on one Primary DB. ( At least that was what I jot down in my notes ). I was assuming they could fit multiple "App Server" Node inside a single 1U Blade. But even if it was 1U per App Server, that would be 30U + likely a Powerful 4U DB Monster. Along with probably some Cache instances.

Even if the number above does not include redundancy, that still only makes it 2 Racks of Servers with spare.

30 Racks is a lot. What am I missing here? Apart from Storage.

The place I work (CTO for now, looking for opportunities pretty soon) can sustain nearly 400 RPS through a Rails app on a single Performance-L Heroku dyno without a sweat (though we run two min), but it's the tip of the iceberg. The infra to support those web servers is way more than that. Aurora Postgres x 3 for now, a large ES cluster, 2x redis clusters, memcached etc.

Bear in mind that we had I think 6 apps. Basecamp 1, 2 and 3 (all separate infra), Highrise, Campfire, some other internal stuff. Our ES cluster was pretty damn big, redis and memcached too. Juniper, F5, network switches (rack infra) etc. Storage was pretty big, quite a few 4u servers with spinning rust. The blade servers were I think 6 blades in 2u.

Everything was redundant, everything had 2x or more. I honestly can't remember how many racks we had. "More than 10" is my hand-wavey guess.

I don't think DHH was being disingenuous with his "2k on 30 servers" message, it was likely more about the scalability of Rails rather than the infra required to run the app.

Do you guys define racks the same way?
I think so, 42U of full-depth server capacity.