Hacker News new | ask | show | jobs
by ruben81adelaide 1507 days ago
@synicalx thanks for your feedback.

I will see if there is anything visible in Cloud Watch.

> - Other than the (presumably) one change made to that RDS instance at ~4.30AM, were there any other changes made to it, specifically to its storage, prior to the autoscaling event?

At 4.30 the following has been logged: "Storage size 999 GB is approaching the maximum storage threshold 1000 GB. Increase the maximum storage threshold."

However the auto scaling event started at 10.17.

> - Had this service been tested on RDS prior to the migration being performed?

We have performed a dozen migration simulations in our Sandbox Account in multiple weeks. We developed scripts and automation to make the actual migration. The only difference in the Sandbox Account is that the RDS database was smaller in CPU and RAM.

> - Were any other changes made, that may potentially effect your DB? For example a query being changed or something of that nature.

I will double check with the team, but all the migration was fully automated with scripts. I have not been reported any action required outside executing the automation scripts and performing the plan.

2 comments

Yeah I feel like something was definitely up with RDS, or at least this event is suspicious enough to warrant further investigation. Step one is definitely get all your "evidence" together - CloudWatch screenshots, logs, maybe cloudtrail etc. and then double check there's nothing you missed or any boo-boos in the scripts or data.

Either way, definitely mention your support interaction to your account manager if you have one, from how you've described it this was a pretty poor interaction especially if you're paying for Business support. If it's definitely not something you caused, I would also ask them to escalate the issue and get you an explanation as to why RDS did what it did.

> Does Cloudwatch confirm that RDS "needed" to scale?

Actually you pointed out to a clue that I missed. I should have checked cloud watch!

The free space graphs shows that at some point something got the 200 GB of space that we originally assigned.

This is a good clue that I am going to dig in.

Thanks for your feedback!

EDIT: Cloudwatch metrics in RDS has been the key to find the source of the issue.

So what was it? My money would be on (in order): Wal, logs, unexpected indices growth.
The things started to go from bad to worse when work_mem parameter has not been set up correctly.

Some queries that requires a large amount of memory to process started to use disk.

Once the autoscaling kicked in, even if we would have realized about it, wouldn't have helped as you are locked out of the system.

The auto-scaling event was triggered earlier than our alerts for low disk available.

What I don't know yet, is that we have been moving services for multiple days, and the service that required the work_mem parameter was in production in that AWS for 48 hours before started to use disk rather than memory to process the SQL Queries.

Interesting problem to come across, sounds like this is a scenario where RDS and the lack of host access/visibility was a bit of a handicap. Glad you found the issue though, hopefully things go smooth next time!