Hacker News new | ask | show | jobs
by samirageb 4182 days ago
It's probably worth noting a few things about StackExchange's uniqueness for those that may not know.

StackExchange runs on physical hardware and they have spent a considerable amount of time optimizing for bare metal performance. Their team is unique in that they embraced hardware from the start vs. many teams today that want hardware abstracted away. They have more maintenance overhead around hardware management, but don't experience lost IOPs due to IaaS (AWS/Azure) overhead.

Their current environment required overcoming tremendous technical hurdles on earlier versions of SQL Server (these might be general RDBMS limitations as well). Luckily they were able to get the Microsoft SQL team engaged to get them through this.

Finally, their team was world class. BrentOzar, SamSaffron, MarcGravel (and others) are highly respected members of the SQL & .NET community.

It's easy to look at their setup and say "Wow, that's not a lot." and overlook the circumstances & talent required to achieve such an efficient system. I'm not sure many teams would pursue this architecture if they knew the effort (and luck) involved.

1 comments

> Their current environment required overcoming tremendous technical hurdles on earlier versions of SQL Server

Jeff Atwood's goal was always "performance is a feature," and that means rendering your pages faster than anybody else's. With that in mind, you're always going to have technical hurdles to overcome on any platform - because you want to get faster results than anybody else is getting.

I remember when I first got involved, and Jeff told me something along the lines of, "This slow query runs in ~500ms, and we want it to run in ~50." That statement alone is a huge leap over what a lot of RDBMS users say - usually when people refer to a slow query, their unit of measurement is whole seconds, not milliseconds. They were serious about performance from the get-go.

> Luckily they were able to get the Microsoft SQL team engaged to get them through this.

Nothing against Microsoft - I love 'em dearly and make a great living with their tools - but Stack's success is much more due to their own internal team's dedication to performance. When we open bug requests with Microsoft, the time to resolution is typically measured in weeks or months. During that time, Stack Exchange's team has to come up with creative, impressive workarounds. They're the sharpest SQL Server users I know.

> Finally, their team was world class. BrentOzar, SamSaffron, MarcGravel (and others) are highly respected members of the SQL & .NET community.

Awww, shucks, but I'm not brown-nosing Stack when I say that their current team is ridiculously good. Their Site Reliability Engineers know more about SQL Server than most career DBAs I know.

> I'm not sure many teams would pursue this architecture if they knew the effort (and luck) involved.

It sounds like you're implying that other architectures are faster by default and with less effort, and I would disagree there. I haven't seen any big sites (big being top-100-in-the-world[1] type stuff) where the persistence layer was set-it-and-forget-it. Scale - especially millisecond performance at scale - is seriously hard work on all platforms.

[1] https://www.quantcast.com/top-sites

Thanks for the reply Brent, a few points I should clarify.

> Luckily they were able to get the Microsoft SQL team engaged to get them through this.

Looking at this statement now, I see that it might have appeared that Microsoft 'came to the Stack team's rescue.' My impression was that over time Microsoft alleviated some of the workarounds the SO team was running into.

> I'm not sure many teams would pursue this architecture if they knew the effort (and luck) involved.

I was simply stating that the SO architectural graphic looks deceptively simple and I can only imagine the amount of drama on hardware alone the team went through to achieve their goals. I do believe there are other OSS-based architectures (with likely more layers) that would require less 'workaround' effort to deliver reasonable performance/reliability, but compounded with SO's likely SLA & perf requirements on a closed-source RDBMS? It wouldn't surprise me to see some go 'good enough' and move on. I doubt anything is Ron Popeil easy when your trying to shed milliseconds on a persistence layer.

Hope this clears things up a bit.

I wonder how many hotfixes MS ended up creating for example.
We've only gotten 2 that I can remember in 4 years here. One was a regression in .Net 4.0 with high load hashtable inserts (which was previously fixed in 3.5 SP1) that they had to re-fix. The other was a SQL server issue in 2014 CTP1 we were running with an availability groups edge case on high volume/high latency secondaries.

Unless we're testing beta software, the MS stuff generally works and works very well. We of course work with them on feature requests that will make life better - some happen, some don't. I'm on the opinion you should try and have this relationship with any vendor or open source project. We're trying to make what we use better, and not for just us.

Nick Craver - Stack Exchange SRE

the collective gasp of HN about a very well tuned and performing site on .net stack is pretty disturbing.