Hacker News new | ask | show | jobs
by mijoharas 2052 days ago
Might have been nice to see this post a year ago :).

My company migrated our ~2TB postgres db from heroku to AWS (originally aurora, we switched back to RDS postgres). Haven't had time for a blog post, but figure I might post our learnings here, and the differences in our approach.

* We did the second step (ec2 -> aurora) via AWS DMS (database migration service). This service may be a little more of a pain than it's worth, but it's what we settled on.

* we used wal-g rather than wal-e for the initial phase, it has slightly better performance, and we were happy with the tool.

* We wrote ansible playbooks to perform the heroku -> ec2 migration. it's definitely nice and makes you feel a little more secure having run the db migration multiple times over, and it would be a giant PITA to do all that manually. (hadn't used ansible much before, I have to say, I'm a fan! had used chef/puppet in bits and pieces before and was never that keen on them).

* Aurora wasn't the right fit for us. Be careful about reading the pricing. On the AWS pricing calculator it asks you for how many million requests you have per month. This is _not_ how it's charged. If you read the docs, the IO charges apply to AuroraReadIO and AuroraWriteIO, which is not per db request, it's per db storage request, which is number of requests * number of db blocks that each request fetches.

* Our Aurora costs were through the roof, (and variable and spiky, never something you want!). We migrated again to RDS postgres, and bumping the provisioned IOPs up we had equivalent performance (possibly better?) for half the price (and the price was stable). We benchmarked everything else but the cost with the db migration, instead relying on the AWS simple pricing calculator. The difference between predicted and actual storage IO costs was over 10x out if my memory serves (making the entire DB double the price per month). BEWARE!

* Aws DMS is kinda buggy, last I checked the table selection editor UI didn't work when you added/modified multiple fields. You can switch to a json editor, so generate the table selection once, and check it in the json editor in case the UI got borked.

* Speed wasn't great with AWS DMS as well, we had to flip BatchApplyEnabled on, and take a lot of care with it.

* There were definitely some shenanigans to do with LOBs, but I can't fully remember them.

* DMS does not update id sequences, so if you have any tables with incrementing ids, make sure that you set the sequences manually before switching dbs.

* We used debian as our intermediate ec2 instance. We've historically used ubuntu for many ec2 instances, but the default ubuntu AMI image uses MBR for it's partition (!!! in 2020!!!) and while we had some code to swap it to GPT while running, that felt a little too spicy for a db migration, so we used base debian which has a sane GPT partitioned disk for it's AMI. (Note: we needed to create partitions larger than what MBR can support so this was a hard requirement).

* The last learning I should add, is that heroku support were actually really great during the whole process, very knowledgeable and gave some setup example files for the ec2 instance. After that they basically leave it to you, but I can't speak highly enough about them (thanks jessie!)

* One more for you, be careful about the version of libc that you have. even if the postgres versions are identical, subsequent libc versions can have slightly different string sorting, which can corrupt your indexes that are migrated after copying the basebackup. (we needed to rebuild an index or two due to this, not a huge deal, but an interesting gotcha. I swear I checked our locale settings 5 times when it went wrong until I clocked on to it being the libc version.)

There were probably more learnings than that, but that's what comes to mind.

(EDIT: added final bullet point). (EDIT 2: added one more bullet)

2 comments

I originally wanted to use DMS but our DB failed the pre-migration assessment so that forced us to Logical Replication.

We also worked with Jesse. You are right he is amazingly helpful. Unfortunately we got some pretty questionable help from others before he was assigned to our support ticket but once that happened it was infinitely improved.

Ahhhh... well you might've dodged a bullet to be honest! If I were to do it all again, there's a good chance I might stick with the standard tools rather than DMS.

Thanks for the in depth article by the way, very well written!

Keep an eye on your AuroraStorageIO costs going forward, as we said, we halved our costs and increased performance by switching back to RDS and increasing the PIOPs available. I'm sure that'll be workload sensitive though, so YMMV. We did a bunch of benchmarking, and aurora seemed nice apart from this.

We must've lucked out, as we got Jesse at the start.

Fantastic comment, thank you so much. I've been following Aurora for a while but a little curious (and skeptical) about sharp edges vs vanilla PG. If I may ask you a follow-on question, who do you think Aurora would be the right fit for?
Hmmmm... That's a good question.

The performance was great, I can't complain about it, but the unpredictable (and high) price was the issue.

In terms of aurora I think it probably can scale better to much larger workloads. If we put the time in and dug in and optimised any queries that could touch a large number of blocks (we do well, but I'm sure there are queries that have some seq scans somewhere in our codebase) then the price wouldn't have been an issue, and we'd happily have stayed with it.

In addition, we have a single large master and one replica. with my understanding of aurora's architecture, and how the storage is separated from the "postgres" instance, it could possibly do a lot better with drastically different architectures with a lot of different smaller "postgres" instances. We didn't have time to investigate many other architectures and in the end moving to RDS felt safest (didn't want to spend too much time when we could be building core features)

Hope that answers your question! Some of the above is more an educated guess with a smattering of experience so please treat it as such.

On the other hand, vanilla postgres is very powerful, and you shouldn't underestimate how much you can win by cranking up the PIOPs for an RDS postgres instance.

Remember, always benchmark representative workloads.