Hacker News new | ask | show | jobs
by lovehashbrowns 1011 days ago
That's certainly real and something to consider when provisioning systems. I'm fully on board with that. The problem is when the cost of the cost-savings solution vastly outweighs the cost of over-provisioning infrastructure. Like this Jenkins issue bubbling up ~2-4 times a month vs just giving the worker nodes more storage space. There's been times where it happened during the night and people got paged.

Or comparing the cost of one store not being able to open on time because the RDS database's space ran out. VPs and directors start yelling and there's suddenly like 20+ people involved in figuring out why this one store didn't open on time. What's the cost of that compared to just giving the DB 250GB of space so this never comes up again?

But you are also 100% correct and I've seen that happen here, too. There's some instances I'm responsible for that were using EFS for their local storage. Costing thousands of dollars every month for absolutely no reason. I switched those to reasonably-sized EBS volumes and that alone was half of my annual savings goal.

I was completely flabbergasted seeing these instances using EFS while others were stuck on 8GB EBS volumes. Backups on the EFS drives had ballooned to the many TBs. And the backups were worthless! Instances themselves are ephemeral. They use S3 for long-term storage & metadata is on a database. Those are the things that should be backed up & their cost compared to EFS is minuscule.

1 comments

Yeah. I suppose the tricky thing is:

> compared to just giving the DB 250GB of space so this never comes up again?

As long as there is reasonable confidence in that this is actually the case, then just provision the space and be done with it. That requires a certain understanding of future space requirements/expectations, and anything even just so slightly running away / leaking space will hit any limit given enough time. So, due diligence requires looking at whether it's actually needed.

Yup, I implemented a bunch of graphs and alerts. Right now it's at 100GB of usage so it's still growing but at a fairly predictable rate. Another nice thing to know is if it's possible to reduce that usage. I haven't been able to look into that but I know one of the causes of the usage increase. The service uses the DB to store some indexing data. There's a team forcing it to re-index and I can tell when they deploy because the storage spikes a little bit every time they do a deployment. Nothing I can do about that, sadly.