Hacker News new | ask | show | jobs
by synicalx 1514 days ago
Something definitely seems off here, from the fact that RDS chose to scale to the response you got from support (who've always been... mostly ok in my experience).

First of all I think I'd like to find out a little more context before I jump to any conclusions;

- Does Cloudwatch confirm that RDS "needed" to scale? And if it does, do any other metrics increase simultaneously with an increase in the storage being used?

- Other than the (presumably) one change made to that RDS instance at ~4.30AM, were there any other changes made to it, specifically to its storage, prior to the autoscaling event?

- Had this service been tested on RDS prior to the migration being performed?

- Were any other changes made, that may potentially effect your DB? For example a query being changed or something of that nature.

To me, it sounds like support found something that suggested whatever was happening to RDS at the time was "someone else's problem" under their shared responsibility model. Whether or not that's true - who knows but from how you've described it they definitely seem to be trying to palm you off, worth mentioning to your TAM/rep if you have one because this is pretty poor service.

1 comments

@synicalx thanks for your feedback.

I will see if there is anything visible in Cloud Watch.

> - Other than the (presumably) one change made to that RDS instance at ~4.30AM, were there any other changes made to it, specifically to its storage, prior to the autoscaling event?

At 4.30 the following has been logged: "Storage size 999 GB is approaching the maximum storage threshold 1000 GB. Increase the maximum storage threshold."

However the auto scaling event started at 10.17.

> - Had this service been tested on RDS prior to the migration being performed?

We have performed a dozen migration simulations in our Sandbox Account in multiple weeks. We developed scripts and automation to make the actual migration. The only difference in the Sandbox Account is that the RDS database was smaller in CPU and RAM.

> - Were any other changes made, that may potentially effect your DB? For example a query being changed or something of that nature.

I will double check with the team, but all the migration was fully automated with scripts. I have not been reported any action required outside executing the automation scripts and performing the plan.

Yeah I feel like something was definitely up with RDS, or at least this event is suspicious enough to warrant further investigation. Step one is definitely get all your "evidence" together - CloudWatch screenshots, logs, maybe cloudtrail etc. and then double check there's nothing you missed or any boo-boos in the scripts or data.

Either way, definitely mention your support interaction to your account manager if you have one, from how you've described it this was a pretty poor interaction especially if you're paying for Business support. If it's definitely not something you caused, I would also ask them to escalate the issue and get you an explanation as to why RDS did what it did.

> Does Cloudwatch confirm that RDS "needed" to scale?

Actually you pointed out to a clue that I missed. I should have checked cloud watch!

The free space graphs shows that at some point something got the 200 GB of space that we originally assigned.

This is a good clue that I am going to dig in.

Thanks for your feedback!

EDIT: Cloudwatch metrics in RDS has been the key to find the source of the issue.

So what was it? My money would be on (in order): Wal, logs, unexpected indices growth.
The things started to go from bad to worse when work_mem parameter has not been set up correctly.

Some queries that requires a large amount of memory to process started to use disk.

Once the autoscaling kicked in, even if we would have realized about it, wouldn't have helped as you are locked out of the system.

The auto-scaling event was triggered earlier than our alerts for low disk available.

What I don't know yet, is that we have been moving services for multiple days, and the service that required the work_mem parameter was in production in that AWS for 48 hours before started to use disk rather than memory to process the SQL Queries.

Interesting problem to come across, sounds like this is a scenario where RDS and the lack of host access/visibility was a bit of a handicap. Glad you found the issue though, hopefully things go smooth next time!