Hacker News new | ask | show | jobs
by jeltz 70 days ago
Turns out the amd machine had huge tables enabled and after disabling those the regression was there on and too. So arm vs amd was a red herring.

Of course not a nice regression but you should not run PostgreSQL on large servers without huge pages enabled so thud regression will only hurt people who have a bad configuration. That said I think these bad configurations are common out there, especially in containerized environments where the one running PostgreSQL may not have the ability to enable huge pages.

2 comments

Still that huge a regression that affects multiple platforms doesn't sound too neat, did they narrow down the root cause?
That should be obvious to anyone who read the initial message. The regression was caused by a configuration change that changed the default from PREEMPT_NONE to PREEMT_LAZY. If you don’t know what those options do, use the source. (<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...>)
Yes, I had a good laugh at that. It might technically be a regression, but not one that most people will see in practice. Pretty weird that someone at Amazon is bothering to run those tests without hugepages.
I doubt they explicitly said "I'll run without huge pages, which is an important AWS configuration". They probably just forgot a step. And "someone at Amazon" describes a lot of people; multiply your mental probability tables accordingly.
The number of people at Amazon is pretty much irrelevant; the org is going to ensure that someone is keeping an eye on kernel performance, but also that the work isn’t duplicative.

Surely they would be testing the configuration(s) that they use in production? They’re not running RDS without hugepages turned on, right?

> The number of people at Amazon is pretty much irrelevant; the org is going to ensure that someone is keeping an eye on kernel performance, but also that the work isn’t duplicative.

I'd guess they have dozens of people across say a Linux kernel team, a Graviton hardware integration team, an EC2 team, and a Amazon RDS for PostgreSQL team who might at one point or another run a benchmark like this. They probably coordinate to an extent, but not so much that only one person would ever run this test. So yes it is duplicative. And they're likely intending to test the configurations they use in production, yes, but people just make mistakes.

True; to err is human. But it is weird that they didn’t just fire up a standard RDS instance of one or more sizes and test those. After all, it’s already automated; two clicks on the website gets you a standard configuration and a couple more get you a 96c graviton cpu. I just wonder how the mistake happened.
You're assuming that they ran the workload with huge-pages disabled unintentionally.