Hacker News new | ask | show | jobs
by Ne02ptzero 1835 days ago
> For the monthly expenses, with most of the components running on-premises, there was a 90.61% cost reduction, going from US$ 38,421.25 monthly to US$ 3,608.99, including the AWS services cost.

I might be missing something, but $3k/month (let alone $38k/month) sounds absolutely insane to me for how little metrics they're collecting (4k5 metrics per second, 2.7TB of data per year). Is the money going for network bandwidth or something along those lines?

2 comments

AWS, for example, charges for cross-AZ data transfers. Naive setups with multiple AZs (us-west-1a, 1b, etc.) and a centralized Prometheus setup will rack up quite the cost.
The reducing cross-AZ data transfer savings on one service resulted in a low 6 figure per year savings. Its something we overlooked during initial setup and now its something I check for when dealing with AWS networking.
Out of curiosity, what's your plan for recovery during an AZ or regional outage?
If RDS or other “hard to replicate very quickly during disaster” infra is being run I personally would still have cross A-Z replication at minimum, to reduce network costs I would configure the “other zone” as a backup replica only and not for performance clustering.

With automation we can spin up full new compute stacks, including load balancers and DNS in about 5-10 minutes per “unique” environment configuration.

While it guarantees we could never have a no downtime failover, we’re okay with it and have more than halved our network costs (which admittedly were about number 8 on our AWS bill by cost).

If AWS has a regional outage this service is so far down the list of services to recover/restore that it probably will be overlooked. Accept the increased risk for the cost savings since it meets the reliability requirements of the service.
They say they ingest 226gb per month. The cross-az transfer cost is $0.01 per gb. So that should come out to only $2.26/month for them.
They are also storing the data in S3 (ingest bandwidth, storage). Plus running an RDS instance (instance cost, x2 if replicated, bandwidth again possibly x2 or more, IO cost) as well as local storage (EBS size/IO cost). And the size of the raw metrics might not be the wire size, especially if it’s uncompressed JSON. My guess on reviewing the post is that a lot of their savings were inter-AZ bandwidth conservation. But it’s hard to say without poking around in their AWS console :)
I agree. I'm working with a metrics system that takes just over 1 million metrics a second and it has a similar run rate to ~38k a month.