Hacker News new | ask | show | jobs
by netwo233gur 1668 days ago
The Cloudflare blog post really only looks at wholesale cost of bandwidth and compares it to the price AWS charges. But I think it's missing a huge component of all of the magic that happens inside AWS between those two things.

I've seen some of the inner workings of the big cloud providers' networking stacks. The networking infrastructure, the software that runs it, the software that exposes it to customers, the thousands of engineers working at any given moment in AWS/GCP/Azure's NOCs to maintain uptime are truly some of the most impressive technical marvels I have ever seen. They aren't as sexy to discuss on HN as something like the managed containers services, functions as a service, EC2 etc, but the networking stacks like the VPC, NAT gateways, subnet routing, privatelinks, security groups, ENIs, nitro cards, etc are pure magic as far as I'm concerned and are so so so much more complicated than a standard data center's networking stack, or even Cloudflare's stack.

To use Cloudflare's "bucket of water" metaphor, AWS isn't even close to just being a dumb bucket of water that you fill with water and then get charged to take out the water. There is so much that happens inside of that bucket to segment your water into different pipes, routing your water in all kinds of customer-customizable ways for many different use cases, mixing/heating/cooling your water as you need, all while guaranteeing things like making sure your water arrives exactly where it is supposed to arrive and doesn't get contaminated or leaked along the way.

Does AWS make a big markup on bandwidth? Yea, surely they do. But is it as simple as Cloudflare says it is? Not even close.

4 comments

Yes, their network stacks are definitely complex and cost a lot to maintain, I'm sure. But that doesn't necessarily make it a good deal if the customer isn't able to derive enough additional value from all that complexity. In fact it makes the offering less attractive if the complexity isn't sufficiently abstracted away and distract from product work or if their abstractions are leaky.

Recently I've been working with https://fly.io/ for a new app and it's a breath of fresh air compared to working with the big cloud providers. They offer simple but robust networking primitives built on top of ipv6 and WireGuard and provide a ton of value add on top like global distribution & load balancing, service discovery, TLS termination, all of which just work exactly like I'd expect it to, out of the box without any configuration on my side.

EDIT: Almost forgot to mention: their egress costs are also much more reasonable: https://fly.io/docs/about/pricing/#outbound-data-transfer

I'm watching fly.io with interest, I want to see how they handle the first major incidents - response time, lessons learnt, transparency before I trust them with a production site though. Most SRE skills related to your own operations are all learnt on the battlefield and not via some cliche must-read book from Google engineers afaic.

If its Linode style - delayed status page updates - sometimes as much as 15minutes, zero detail post-mortems - this problem has been fixed by our engineers thank you yada yada, and same issues repeat six months down the line then I will be understandably disappointed.

I've only been with them through one major incident so far, and I recall them handling it reasonably well.

You can see them responding to customers and providing updates in real time here: https://community.fly.io/t/there-seems-to-be-an-outage-with-...

And a detailed postmortem here: https://community.fly.io/t/major-outage-portmortem-2021-10-1...

They also update their status page pretty diligently whenever something goes wrong even for things that don't necessarily impact all customers (the only recent item on there that affected my app directly was the Oct 13 one from what I can remember): https://status.flyio.net/history

> But that doesn't necessarily make it a good deal if the customer isn't able to derive enough additional value from all that complexity.

It’s simply obvious that it’s not a good deal if you’re not their target customer with a use case they cater to. However, it could be a good deal if you have a relevant use case. Unless it’s being suggested that AWS caters to everyone in all cases then it adds nothing to the conversation to point it out.

https://www.hetzner.com/cloud gives you 20TB bandwidth for €3.49/mo VMs, which I've essentially regarded as Hetzner gives unlimited free bandwidth for all servers.

Being lynched for egregious egress fees is only something I've experienced when using mega corp's clouds, where economies of scale suggests their vastly larger size should allow them to provide even better value.

But that's in a normal market, not the artificial lock-in mega cloud corps enjoy where they're able to distort customer behavior from artificially high pricing.

I'm a Hetzner home user and a huge fan, but let's not compare the quality of networking you get for free from them with the networking you get from AWS.

I don't think I've seen a latency spike on AWS in 10 years. Hetzner, it's often possible to observe latency and drops over 10 minutes (and the situation hasn't changed much in about 10 years)

In all the years I've used Hetzner I've never observed these random 10 minute latency drops you speak of. They've always had much faster internet access then I've ever been able to get from my home broadband so I'll even SSH into & use them for network intensive dev tasks like iterating on a new Docker container since it's able to download & build the image packages in a fraction of the time.

The primary issue I have with them is latency access to their DE/FI data centers from the US, if their US DC offered dedicated servers I would be migrating to over to use them instead.

They launched Cloud in the US this month, very likely dedicated will be offered soon enough. The bang for buck on Hetzner is insane, really love them, but have and would rip them out of any business environment I come across, largely due to network quality and attitude to support.

If you haven't experienced Google translating insistently German responses from one of their DC techs you probably haven't been using them for long enough ;)

As for networking, would encourage installing something like Smokeping

I've needed to access their tech support 1 time when my HDD failed and a couple of times for new SSL certs before LetsEncrypt, who were always responsive and supportive. Don't see how derogatory characterizations of their DC techs is in anyway necessary.

But I don't really access AWS support either, when something doesn't work I've just killed the VM and started a new one. It's less disposable with bare metal servers, I can physically restart the server from their control panel or if issues are not fixable, reset the server with a new Linux OS image, which granted would be a lot more time consuming.

I will add that whilst I'm not in the business of dictating which cloud services business customers would use, I'd agree that I would recommend AWS over Hetzner who are a) paying for & would have to administer it themselves and b) is going to have access to all the managed services they would ever need in future.

I would still recommend they consider Hetzner for any high-resources intensive workloads where their raw compute is vastly less expensive. I'll also chose the cheaper reoccurring cost over convenience when I'm able to self-service it myself.

Hetzner support when I have needed it, has always been faster and of better quality than AWS or Azure. All emails and talking was in English
Hetzner has its network hiccups sometimes, but AWS quality may be a joke if you really care about latency tails and even median under any significant load. I didn't analyze the networking itself, but - you run in a VM and share host machine with other clients VMs - you just can't get stable latencies this way. It's night and day when you migrate to baremetal Hetzner and observe how latencies change. (Again - it's about dedicated baremetal - I know nothing about Hetzner's cloud)
It's not really that much magic. It's just a variation of EVPN-VXLAN plus smart NICs that segments and directs the traffic. Then they have normal VM hosts or nowadays devices with ASICs that handle the GW and NAT functionality.
Custom ASICs (Nitro chips) aren't magic? Maybe so, but they cost money to develop.

All of the other networking stuff ( Security Groups, NACLs, flow logs, VPCs, subnets, etc.) you don't directly pay for, isn't magic either, but also cost money.

Nitro is just a fancy converged host adapter with Smart NIC functionality. It's unclear to the industry how much of Nitro is custom, and how much of it is existing IP that is cobbled together (e.g. Graviton and the ARM Neoverse cores).

The ASICs are on the fabric doing the routing and NAT for all the traffic in the AZ. These ASIC are unlikely to be custom. Hyperscale operators typically use open networking hardware with merchant silicon. You can get open networking hardware to do all sorts of packet manipulation, and these devices are a cheaper than traditional manufacturers, but more powerful as they expose more low-level interfaces.

All those features you talk about are implemented from features that are provided by these hardware platforms.

AWS is just putting an managed service together from them, no different to how they take postgres, do some tweaks and rebrand it as an AWS service.

It's weird to me how people think contrasting a raw pipe billed on 95th percentile to a service like S3 or Cloudflare is in any way a fair comparison.
S3 has its own data retrieval costs, as do several of their managed services.

Those are separate charges from the EC2 costs the Cloudflare blog post discussed.

Egress does not mean S3 or Cloudflare. Egress is the raw pipe billing from AWS to the wider internet. Other services are priced differently.
That’s where folks are revealing how clueless they are.

Raw pipe isn’t priced in GB it’s in capacity. To serve GB on Black Friday you need to provision far far more capacity

Yeah but people can do that math. Peak bandwidth is not 200x higher than average bandwidth.

Cloudflare used a 5x multiplier. How high do you think it needs to be? Does total AWS bandwidth even go up that much on black friday?