Hacker News new | ask | show | jobs
by DaGardner 1740 days ago
Yes it does: https://stackexchange.com/performance

Pretty impressive I think.

2 comments

No it doesn’t. From your link:

• 9 web servers

• 4 SQL servers

• 2 Redis servers

• 3 tag engine servers

• 3 Elasticsearch servers

• 2 HAProxy servers

That comes to 23. I know “a couple” is sometimes used to mean more than two, but… not that much more than two.

“A couple” is just flat-out wrong; I’d guess that he’s misinterpreting ancient figures, taking the figures from no later than about 2013 about how many web servers (ignoring other types, which are presently more than half) they needed to cope with the load (ignoring the lots more servers that they have for headroom, redundancy and future-readiness).

One interesting aspect is that the number of servers is much higher than what would actually be needed to run the site, most servers run at something like 10% CPU or lower. Most of the duplication is for redundancy. As far as I remember they could run SO and the entire network on two web servers and one DB server (and I assume 1 each of the other ones as well).

If someone says SO runs on a couple servers this might be about the number actually necessary to run it with full traffic, not the number of servers they use in production. This is a more useful comparison if the question is only about performance, but not that useful if you're comparing operating the entire thing.

IIRC, without emergency redeploying, they might have issue running less than 4 - not sure if the tag server can coexist with web server anymore for example, redis is still a dependency, so is haproxy, separated SQL and IIS, etc.

Then there's support services (iirc, all of elasticsearch was non-functional requirements stuff and technically could be run without?) and HA.

23 is not a lot of servers.

That is still doable with mid-90s era hand management of servers (all named after characters in lord of the rings).

Not that you should, but you could.

And the growth rate must be very low and pretty easy to plan out your O/S upgrade and hardware upgrade tempo.

And it was actually possible to manage tens of thousands of servers before containers. The only thing you really need is what they now call a "cattle not pets" mentality.

What you lose is the flexibility of shoving around software programmatically to other bits of hardware to scale/failover and you'll need to overprovision some, but even if half of SOs infrastructure is "wasted" that isn't a lot of money.

And if they're running that hardware lean in racks in a datacenter that they lease and they're not writing large checks to VMware/EMC/NetApp for anything, then they'd probably spend 10x the money microservicing everything and shoving it all into someone's kubernetes cloud.

In most places though this will fail due to resume-driven design and you'll wind up with a lot of sprawl because managers don't say no to overengineering. So at SO there must be at least one person in management with a cheap vision of how to engineer software and hardware. Once they leave or that culture changes the footprint will eventually start to explode.

Most of that is extra unused capacity. They've shared their load graphs and past anecdotes where it's clear the entire site runs very lean.

Also 23 is very much a couple for a company and application of that size. It's not uncommon to see several hundred or thousands of nodes deployed by similar sites.

Two of their servers have 1.5 TB of RAM each. Just one of those nodes is probably as powerful and expensive as 100 nodes in a thousand node setup.

They aren't magically more efficient than other sites. They just chose to scale vertically instead of horizontally.

> "They aren't magically more efficient than other sites"

It's certainly not magic but good architecture decisions and solid engineering. This includes choosing SQL Server over other databases (especially when they started), using ASP.NET server-side as a monolithic app with a focus on fast rendering, and yes, scaling vertically on their own colo hardware. The overall footprint for the scale they serve is very small.

It's the sum of all these factors together, and it absolutely makes them more efficient than many other sites.

Exactly. That twitter thread is just pure rage based on no data. Sum up resources from that page - we are talking around 6500GB* of RAM worth of servers. That is no homelab.

* Maybe a bit more/less, because it's not clear to me if DB RAM is per server, or per cluster. Likely server, as on other servers. There is also no data on how big is their haproxy.

And yet the main point stand : they don't need K8s to manage this application running on 23 servers.
No one needs k8s. Bringing up their infrastructure in a k8s troubleshooting how-to was a weird thing to do in the first place. It's comparing apples and chandelier - makes no sense.

They have a typical vertically scaled infrastructure, most services have just two nodes, one active. The biggest ones are databases which in many companies are handled in "the classic way" anyway. Clearly it's not designed as microservices and doesn't need dynamic automation at all. Why on earth would they even bring k8s up in their plans?

And yet it wouldn't be out of place either.
Nevertheless, it is true that Stack Overflow has focused on backend performance and scaled vertically a long way, further than is fashionable. Just not so far as only using two servers for everything.
Because they're constrained by Windows and Microsoft licensing, scaling out was never an easy option for them.
I'm curious. I saw a similar comment earlier, surely the surely windows licensing is just a drop in the bucket compared to the rest of the infrastructure costs?

I've not really looked at hosting anything on windows before, do they have unusual licensing terms in such a way that it would be a significant cost?

What constraints? Windows Server licenses are bought per-core and the company can easily afford plenty more. This is a non-issue.
But adding a new server includes having to buy new licenses, which is a consideration you don't have with OSes that under licensing. It costs extra money, and used to be per socket when their infrastructure was conceived.
>Because they're constrained by Windows

How? didn't they migrate to .NET Core?

Did they ? I must have missed it, but seems so :

https://www.infoq.com/news/2020/04/Stack-Overflow-New-Archit...

That doesn't mean they've moved away from Windows servers hosting it though.

It is impressive, but it's not a raspberry pi kind of setup. Just two of those "couple" are hot and standby DB servers with 1.5TB RAM. That infrastructure is scaled A LOT vertically.