>If this somehow does end up being a reproducible performance issue (I still
suspect something more complicated is going on), I don't see how userspace
could be expected to mitigate a substantial perf regression in 7.0 that can
only be mitigated by a default-off non-trivial functionality also introduced
in 7.0.
> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.
Completely right. This sounds like a communication failure. Maybe Linux maintainers should pick a few applications that have "priority support" and problems with these applications are also problems with Linux itself. Breaking Postgres is a serious regression.
Reminds me of a situation where Fedora couldn't be updated if you had Wine installed and one side of the argument was "user applications are user problem" while the other was "it's Wine, like come on".
Performance regressions are different from ABI incompatibilities. If the kernel refused to do any work that slowed down any userspace program, the pace would go a lot slower.
Or be a lot uglier. See: Microsoft replacing its own API surfaces with binary-compatible representations to workaround companies like Adobe adding perf improvements like bypassing the kernel-provided kernel object constructors because it saved them a few cycles to just hard-code the objects they wanted and memcpy them into existence.
I’m absolutely flabbergasted by the performance left on the table; even by myself - just yesterday I learned Gentoo’s emerge can use git and be a billion times faster.
The time spent by emerge is utterly dwarfed by the time spent to build the packages, so who cares? Maybe it's different if installing a binary system but don't think most people are doing that.
If you can emerge in 2.86s user you can do it right before you emerge world, meaning it's all "done in one interaction" (even if the actual emerge takes an hour - you don't have to look at it.
Whereas if emerge is taking 5-10 minutes, you have to remember to come back to it, or script it.
That's really not universally true. Building can be parallelized on modern multi-code CPUs (minus configure), emerge cannot and portage is really really slow.
Bad because as of Splunk 10.x, Splunk bundles postgres to integrate with their SOAR platform. Parenthetically, this practice of bundling stuff with Splunk is making vuln remediation a real pain. Splunk bundles its own python, mongod, and now postgres, instead of doing dependency checking. They're going to have to keep doing it as long as they release a .tgz and not just an RPM. The most recent postgres vuln is not fixed in Splunk.
1) That is about transparent huge pages which is a different thing and 2) it is always clear cut for PostgreSQL. If you can you should always use huge pages (the non-transparent kind).
Java can work with transparent hugepages (in addition to preallocated hugepages), but you just use +AlwaysPreTouch to map them in during the startup so that at runtime there won't be any delays or jitter.
Redis should add a similar option
AIUI in that thread they're saying "0.51x" the perf on a 96-core arm64 machine and they're also saying they cannot reproduce it on a 96-core amd64 machine.
So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.
That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.
It's a huge issue of ARM based systems, that hardly anyone uses or tests things on them (in production).
Yes, Macs going ARM has been a huge boon, but I've also seen crazy regressions on AWS Graviton (compared to how its supposed to perform), on .NET (and node as well), which frankly I have no expertise or time digging into.
Which was the main reason we ultimately cancelled our migration.
I'm sure this is the same reason why its important to AWS.
Macs are actually part of pain point with ARM64 Linux, because the Linux arm set er tend to use 64 kB pages while Mac supports only 4 and 16, and it causes non trivial bugs at times (funnily enough, I first encountered that in a database company...)
Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.
With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.
So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?
I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.
So perhaps this is a regression specifically in the arm64 code, or said differently maybe it’s a performance bug that has been there for a long time but covered up by the scheduler part that was removed?
Turns out the amd machine had huge tables enabled and after disabling those the regression was there on and too. So arm vs amd was a red herring.
Of course not a nice regression but you should not run PostgreSQL on large servers without huge pages enabled so thud regression will only hurt people who have a bad configuration. That said I think these bad configurations are common out there, especially in containerized environments where the one running PostgreSQL may not have the ability to enable huge pages.
That should be obvious to anyone who read the initial message. The regression was caused by a configuration change that changed the default from PREEMPT_NONE to PREEMT_LAZY. If you don’t know what those options do, use the source. (<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...>)
Yes, I had a good laugh at that. It might technically be a regression, but not one that most people will see in practice. Pretty weird that someone at Amazon is bothering to run those tests without hugepages.
I doubt they explicitly said "I'll run without huge pages, which is an important AWS configuration". They probably just forgot a step. And "someone at Amazon" describes a lot of people; multiply your mental probability tables accordingly.
For production Postgres, i would assume it’s close to almost no effect?
If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.
Ubuntu is used in many serious backend environments. Heroku runs tens of thousands (if not more) instances of Ubuntu on its fleet. Or at least it did through the teens and early 2020s.
and they are right, this is because a lot of junior sysadmins believe that newer = better.
But the reality:
a) may get irreversible upgrades (e.g. new underlying database structure)
b) permanent worse performance / regression (e.g. iOS 26)
c) added instability
d) new security issues (litellm)
e) time wasted migrating / debugging
f) may need rewrite of consumers / users of APIs / sys calls
g) potential new IP or licensing issues
etc.
A couple of the few reasons to upgrade something is:
a) new features provide genuine comfort or performance upgrade (or... some revert)
b) there is an extremely critical security issue
c) you do not care about stability because reverting is uneventful and production impact is nil (e.g. Claude Code)
but 99% of the time, if ain't broke, don't fix it.
I’ve seen more 5k+-core fleets running Ubuntu in prod than not, in my career. Industries include healthcare, US government, US government contractor, marketing, finance.
.. which confirms all of my stereotypes. Looks like the AWS engineer who reported it used a m8g.24xlarge instance with 384 GB of RAM, but somehow didn't know or care to enable huge pages. And once enabling them, the performance regression disappears.
Honest question: what's the value of running the benchmark and reporting a performance regression if the author is not familiar with basic operation of the software? I'd argue that not understanding those settings disqualifies you from making statements about it.
The performance was reduced without a settings change. That is still a regression even if huge pages mitigates the problem.
I'd be curious to know if there's still a regression with hugepages turned on in older kernels.
If you are benchmarking something and the only changed variable between benchmarks is the kernel, that is useful information. Even if your environment isn't correctly setup.
Yet we're talking about postgres, specifically. The whole point is that benchmarks about postgres better know how to configure postgres or their conclusions be irrelevant at best. What does redis have to do with this discussion?