Hacker News new | ask | show | jobs
by madrox 1040 days ago
Important to remember that, before you could burst your infrastructure in the cloud, sites simply went offline in events like this. You took actively lost revenue in those cases.
2 comments

Or you could just design your architecture to not perform trillions of database requests for hundreds of millions of sales.

The listing data is almost static and should almost fit in RAM (the hot set probably does. Apparently Amazon has ~350M listings. A 24TB RAM server could give ~68kB/listing, and probably only a small fraction is hot). Since you'll need multiple servers anyway, you could shard on products and definitely fit things in RAM. 375 million sales even if condensed into 1 hour would only be 104k/second. A single db server should be able to handle the cart/checkout. Assuming ~10M page views/second, a couple racks of servers should be able to handle it.

The ad/tracking infrastructure surely can't account for the 1000x disparity in resource usage.

I think you're forgetting that Amazon doesn't have a 100% conversion rate...
I'm not. That's why I threw out 10M page views per 100k purchases. Maybe 1% is an overestimate of conversions, but I imagine a 48x multiplier to average traffic is an overestimate of peak traffic, so it balances out. It would be interesting to know the actual peak number of user actions/second though.
Are you saying the reason why sites went offline pre-cloud was because engineers were simply bad at design?
More importantly, "pre-cloud" means years ago and therefore older hardware, but also yes, software mostly isn't written to be high performance.

Modern NVMe drives get 1000x the performance of hard disks 10 years ago. You can buy one that can fit the entire reddit text database for $150 now. 10 years ago you'd be looking at a high six-figure SAN appliance from IBM or EMC to get the kind of performance my desktop has now. You can have TBs of RAM and 100+ cores in a server now. You can get 400 Gb/s networking now, and some people even have 10 Gb/s home internet. You could basically run some of the biggest sites from 10 years ago out of your closet these days.

Some software has also improved a lot in the last 10 years. Things like io_uring are great. Green threads are great. Postgres is super fast these days, and it keeps getting faster. My old quad core computer with a SATA drive can already do ~60-70k requests/second with a Scala web app and postgres. That's without even using graal or loom or trying to screw around with core affinity.

If anything, the cloud scales poorly. People in practice end up using vastly underpowered VMs, and then horizontally scale them, which introduces a ton of overhead (computationally and management-wise). RDS gets you like 3000 IOPS baseline and increasing that to the level of a single NVMe drive will cost one employee's firstborn child each month, so people end up with this mistaken belief that RDBMSs are slow or don't scale. AWS will provide you with reference architectures to use lambdas for web requests and advertise their "scalability" [0], but the API only lets you serve 1 request/lambda at a time, and according to their docs, you can only have up to "tens of thousands" of concurrent lambdas[1]. That would also require 10s of thousands of connections to your db, which kills it, and doesn't let you batch things unless you first put your work onto something like an SQS queue and have a separate db worker lambda pick up batches. More infra to manage (and more $$$) instead of writing a dozen lines of code to add an in-memory work queue, and you end up needing to write more code to deal with sending work/status across the system anyway. So my old i5 with an SSD ends up scaling better than their "well-architected" "scalable" serverless solution. AWS will happily give you plenty of this kind of advice that will lead to a slow, expensive, large system.

The one big upside of AWS is that if you do need to manage a lot of servers (like you are in the IoT space and need to handle millions of requests per second), they have good tools for doing that. Multi-region redundancy is also a click of a button if you need that. But they normalize overbuilding (and thus needing that management) way before it's necessary.

[0] https://docs.aws.amazon.com/wellarchitected/latest/serverles...

[1] https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-...

Depending on the margins that could be preferable.