| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nickmerwin 3750 days ago

This is one of the things about being a web developer and part-time DBA that keeps me up at night (sometimes literally all night).

Around a month ago the source file table on Coveralls.io[0] hit a transaction wraparound, and put us in read-only mode for over 24 hours while RDS engineers performed a full vacuum (that's not something that can be done manually as a customer). On a managed database I'm paying thousands a month for, I was hoping there would be some sort of early warning system. Well, apparently there is, but it's buried in the logs, and won't trigger any app exceptions so went un-noticed.

What's worse is there's 0 indication of how long a vacuum is going to take, nor progress updates while it's going. So for a production web app with customers, this means damage control language like:

"Our engineers have identified a database performance issue and working to mitigate. Unfortunately we do not have an ETA at this time."

About a week later, more calamity hit: the INT "id" field on the same table exceeded the max length. My first thought was change it to a BIGINT, but after ~4 hrs into the migration without any indication of how much longer it would take, I pulled the plug and sharded the table instead.

Moral of the story is that web devs should be aware of these pitfalls, and that no matter how much trust you put into a managed database service, it still could happen to you (queue ominous background music).

Anyway I'm glad to see this lurking monster in our beloved database tamed, thank you Mr Haas!

[0] https://coveralls.io

3 comments

amitlan 3749 days ago

> What's worse is there's 0 indication of how long a vacuum is going to take, nor progress updates while it's going.

Upcoming 9.6 will help with this to a certain degree: http://www.postgresql.org/docs/devel/static/progress-reporti...

link

dankohn1 3750 days ago

Thanks for the post. I'm a big fan of coveralls (I use it to collate the simplecov reports from 20 CircleCI instances), and I was wondering about the downtime.

link

merb 3750 days ago

that's why I always use BIGINT for 'most' tables. only when I know that it will never be bigger than INT I will use int.

link