Hacker News new | ask | show | jobs
by mseebach 3394 days ago
20 years ago, sites got "slashdotted" left, right and centre. Cloudflare and friends put an end to that, forever. Sites went offline for days if not weeks because of hardware failure -- some never came back because they didn't back up correctly. S3 went offline for five hours, but didn't lose a single bit of data.

You're looking at the past through rose-tinted glasses. We learn all the time, and we will probably also learn to build some resilience into our systems against these issues. But as someone who's been through a thing or two (see above) on the Internet of yesterday, I like the one we have today.

4 comments

Cloudflare had nothing to do with ending slashdotting. This stopped being a problem years before they existed.

Slashdotting was mostly a problem caused by Apache's incredibly inefficient design. It consumed huge amounts of memory per connection at a time when most of us had very slow connections. A link from Slashdot was, in effect, a Slowloris attack on your server.

The big change was moving from a fork/thread-based webserver (Apache) to an event-based webserver (nginx), which was made even more efficient by kernel features like epoll.

Sorry to be blunt, but this is just incorrect.

The problem with "Slashdotting" was the number of concurrent connections. Heck a fair portion of the time it was the database that keeled over first, not Apache.

Slowloris attacks send purposefully incomplete requests and hold them open with additional headers. Even with dial-up modems, connections were never slow enough for this to be a problem with actual requests, which are lightweight.

Responses are heavy and can tie up slow connections, especially if they have to go get stuff out of the database. But in that case it's no longer a Slowloris type attack. It's just too many concurrent connections.

The Slashdot effect was solved with static HTML caching, simply because caches are faster and don't touch the DB. Cloudflare is a simple, free example of such a cache, although certainly not the only one.

Bluntness is okay, but aren't you wrong about me being wrong here?

I didn't say it was a Slowloris attack. I said Slashdotting was "in effect" the same thing. Which it is, both problems are one of exhausting limited concurrency.

> The problem with "Slashdotting" was the number of concurrent connections.

Exactly the same problem a Slowloris attack exploits.

> Responses are heavy and can tie up slow connections...

Yes, responses tie up the limited number of available httpd processes.

The problem was that Apache couldn't even serve static files to many clients because of its heavy weight httpd processes and the fact that clients were so slow.

If your web server can only handle 200 concurrent connections, and you want to serve a 500 KB screenshot of your 1337 Linux desktop to clients that download at 3.5 kbyte/s, you can handle like ~1.4 req/s. Doesn't take much to get Slashdotted.

Whereas, event-based webservers could handle at least 10x more connections on the same hardware even before epoll existed.

I had this problem in 1998 and fixed it with select/poll based servers, and then eventually other epoll-based servers before nginx existed.

If we go back to 1998, maybe network throughput was the limiting factor that drove up concurrent connections and killed servers. But I also don't think we can say nginx solved that, since nginx didn't start seeing wide usage until about 10 years later.

I guess I wrote what I did because the comparison to Slowloris seemed to over-emphasize the importance of handling high numbers of concurrent connections, since that is the only mitigation for Slowloris.

But, for a flood of real traffic, concurrent connections and throughput are related. The faster your web server can serve responses, the fewer concurrent connections it will need to handle. And as the percentage of dynamic DB-backed sites has increased over time, so has the value of caching. Basic page caching can speed up a Wordpress blog by hundreds of times for unauth'd users, for example. For most little sites, implementing caching will get them more than installing nginx.

And really, what good are valid concurrent connections if the throughput isn't there? For most users, a site that waits 5 minutes on a blank page is no better than a server that's down.

One of the main causes for that was a one line fix in the apache config:

 	KeepAlive Off
Yup!
> This stopped being a problem years before they existed.

I still see this issue happening daily here on HN.

And every time it happens 10 people jump to comment about how the site is written poorly and could probably be an entirely static site.
This sounds plausible, but is there any way to validate this with some kind of empirical evidence?
Some website being Slashdotted is not in the same class of issues as users' (possibly encrypted) page content being spread all over the web. Also, the notion that you either need to use some super-centralized platform or be susceptible to Slashdot effect or loosing backups is a false dichotomy.
>>Sites went offline for days if not weeks because of hardware failure -- some never came back because they didn't back up correctly. S3 went offline for five hours, but didn't lose a single bit of data.

S3 going offline for five hours probably caused orders of magnitude more damage. The reason is simple: the stakes are higher today than they were 20 years ago. Back then the overwhelming majority of the business in most companies was conducted on paper and in person. Losing a website for a few days, or even a few weeks, wasn't a big deal. Today though? There are so many companies that host business-critical operations and infrastructure in the cloud that it's hard to fathom how they are coping with being taken offline for several hours in the middle of the week.

Wait. How does Clouldflare prevent a site from going offline for weeks because of hardware failure?
S3 is the reference here, not Cloudflare.
Nonetheless, CloudFlare caches sites and may serve a cached version of you site when it's down, assuming it's "static" enough to be readable from cache.