Hacker News new | ask | show | jobs
by rurp 1040 days ago
> $102 million in infrastructure spend for an event that brought in over $12.7 billion in sales isn’t the worst return on investment that companies could make — by a landslide!

Well it's not amazing if your margin's are tiny, as they are in many industries (such as retail). Plus this was almost certainly architected by some of the foremost AWS experts in the world. It's verrrry easy to spend vastly more than was strictly necessary in AWS.

I don't mean to be too negative though, it was a really interesting article. Pretty wild to think about spending $100m on infrastructure over two days and still making a bunch of profit.

3 comments

Important to remember that, before you could burst your infrastructure in the cloud, sites simply went offline in events like this. You took actively lost revenue in those cases.
Or you could just design your architecture to not perform trillions of database requests for hundreds of millions of sales.

The listing data is almost static and should almost fit in RAM (the hot set probably does. Apparently Amazon has ~350M listings. A 24TB RAM server could give ~68kB/listing, and probably only a small fraction is hot). Since you'll need multiple servers anyway, you could shard on products and definitely fit things in RAM. 375 million sales even if condensed into 1 hour would only be 104k/second. A single db server should be able to handle the cart/checkout. Assuming ~10M page views/second, a couple racks of servers should be able to handle it.

The ad/tracking infrastructure surely can't account for the 1000x disparity in resource usage.

I think you're forgetting that Amazon doesn't have a 100% conversion rate...
I'm not. That's why I threw out 10M page views per 100k purchases. Maybe 1% is an overestimate of conversions, but I imagine a 48x multiplier to average traffic is an overestimate of peak traffic, so it balances out. It would be interesting to know the actual peak number of user actions/second though.
Are you saying the reason why sites went offline pre-cloud was because engineers were simply bad at design?
More importantly, "pre-cloud" means years ago and therefore older hardware, but also yes, software mostly isn't written to be high performance.

Modern NVMe drives get 1000x the performance of hard disks 10 years ago. You can buy one that can fit the entire reddit text database for $150 now. 10 years ago you'd be looking at a high six-figure SAN appliance from IBM or EMC to get the kind of performance my desktop has now. You can have TBs of RAM and 100+ cores in a server now. You can get 400 Gb/s networking now, and some people even have 10 Gb/s home internet. You could basically run some of the biggest sites from 10 years ago out of your closet these days.

Some software has also improved a lot in the last 10 years. Things like io_uring are great. Green threads are great. Postgres is super fast these days, and it keeps getting faster. My old quad core computer with a SATA drive can already do ~60-70k requests/second with a Scala web app and postgres. That's without even using graal or loom or trying to screw around with core affinity.

If anything, the cloud scales poorly. People in practice end up using vastly underpowered VMs, and then horizontally scale them, which introduces a ton of overhead (computationally and management-wise). RDS gets you like 3000 IOPS baseline and increasing that to the level of a single NVMe drive will cost one employee's firstborn child each month, so people end up with this mistaken belief that RDBMSs are slow or don't scale. AWS will provide you with reference architectures to use lambdas for web requests and advertise their "scalability" [0], but the API only lets you serve 1 request/lambda at a time, and according to their docs, you can only have up to "tens of thousands" of concurrent lambdas[1]. That would also require 10s of thousands of connections to your db, which kills it, and doesn't let you batch things unless you first put your work onto something like an SQS queue and have a separate db worker lambda pick up batches. More infra to manage (and more $$$) instead of writing a dozen lines of code to add an in-memory work queue, and you end up needing to write more code to deal with sending work/status across the system anyway. So my old i5 with an SSD ends up scaling better than their "well-architected" "scalable" serverless solution. AWS will happily give you plenty of this kind of advice that will lead to a slow, expensive, large system.

The one big upside of AWS is that if you do need to manage a lot of servers (like you are in the IoT space and need to handle millions of requests per second), they have good tools for doing that. Multi-region redundancy is also a click of a button if you need that. But they normalize overbuilding (and thus needing that management) way before it's necessary.

[0] https://docs.aws.amazon.com/wellarchitected/latest/serverles...

[1] https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-...

Depending on the margins that could be preferable.
Yes, at 1% margin on those sales, that's more like $125M in revenue. It's important to remember that things like Prime Day are basically marketing that results in revenue outside the event.
>It's important to remember that things like Prime Day are basically marketing

Be it Prime Day or Black Friday/Cyber Monday sales, I've seen the prices before the sale starts, and then once the sales start, it is the same price but with a slashed out higher MSRP type price. It's not any more of a sale during the sale than it was any of the other days.

Yea, actual profit was likely 100 - 400 million or so. As such spending 102 million on a single line item would be a serious question for most companies.

Of course Amazon is paying itself that premium so they have little incentive to care.