Hacker News new | ask | show | jobs
by pgaddict 1859 days ago
It isn't solved, and no one claimed it to be solved. The scalability improvement is related to how we build MVCC snapshots (i.e. information which transactions are visible to a session). That may reduce the memory usage a bit, but it's more about CPU I think.

As for the per-connection memory usage, the big question is whether there really is a problem (and perhaps if there's a reasonable workaround). It's not quite clear to me why you think the issues in your case are are due to idle connections, but OK.

There are two things to consider:

1) The fixed per-connection memory (tracking state, locks, ..., a couple kBs or so). You'll pay this even for unused connections.

2) Per-process memory (each connection is handled by a separate thread).

It's difficult to significantly reduce (1) because that state would no matter what the architecture is, mostly. Dealing with (2) would probably require abandoning the current architecture (process per connection) and switching to threads. IMO that's unlikely to happen, because:

(a) the process isolation actually a nice thing from the developer perspective (less locking, fewer data races, ...)

(b) processes work quite fine for reasonable number of long-lived connections, and for connection pools address a lot of the other cases

(c) PostgreSQL supports a lot of platforms, some of which may not may not have very good multi-threading support (and supporting both architectures would be quite a burden)

But that's just my assessment, of course.

2 comments

I wonder if the amount of RAM used by a new process can be reduced. Code and other RO segments are shared anyway, so it's only basically the new heap and various buffers.

Reducing this amount would also run Postgres in more constrained environments.

There are two parts of this - the memory allocated by OS and internally.

At the OS level, we can't really do much, I'm afraid :-( I don't think we're wasting too much memory there, exactly because a lot of the memory is shared between processes. Which also makes it difficult to determine how much memory is actually used by the processes (the sharing makes the various metrics in ps/top are rather tricky to interpret).

As for the internal memory, it's a bit more complicated. We need a little bit of "per process" memory (per-backend entries in various internal data structures, etc.) - a couple dozen/hundred kBs, perhaps. It's hard to give a clear figure, because it depends on max_locks_per_transaction etc. This is unlikely to go away even if we switched to threads, because it's really "per session" state.

But then there are the various caches the processes keep, memory used to run queries etc. Those may be arbitrarily large, of course. The caches (with metadata about relations, indexes etc.) are usually a couple MBs at most, but yes, we might share them between threads and save some of this memory. The price for that would be the need for additional synchronization / locking, etc. The memory used to run queries (i.e. work_mem) is impossible to share between threads, of course.

There's a blog post by Andres Freund with more details: https://www.citusdata.com/blog/2020/10/08/analyzing-connecti...

Overall, there's very little chance PostgreSQL switch to threads (difficulty of such project, various drawbacks, ...). But I do agree having to run a separate connection pool may be cumbersome, etc. There was a proposal to implement integrated connection pool, which would address at least some of those problems, and I wouldn't be surprised if it happened in foreseeable future.

And this right here is why PostgreSQL will never overtake MySQL and its forks. The entire industry is sick of these excuses regarding process-per-client instead of a proper multi-threaded model. There may have been a valid argument for this 15 years ago, but not anymore.

Your definition of "reasonable number of long-lived connections" is anything but reasonable. Then "connection pools address a lot of the other cases", when a connection pool/bouncer is unwanted, unwarranted, and just adds another point of failure that needs to be deployed and maintained.

I disagree, for a number of reasons.

Firstly, it's not the goal of the PostgreSQL project to overtake MySQL or other databases, but to serve the existing/new users. This also means we're investing the development effort in a the highest benefit / effort ratio. Even if switching from process-based to thread-based model improved the per-connection overhead, the amount of work needed is so huge the benefit / effort ratio is so utterly awful no one is going to do it. There are always better ways to invest the time / effort. Especially when there are practical solution / workarounds like connection pools.

Secondly, every architecture has pros/cons, and switching from processes to threads might help in this respect but there are other consequences where the process model is superior (some of which were already mentioned). Focusing on just this particular bit while ignoring the other trade-offs is rather misleading.

And no, the arguments did not really disappear. To some extent this is about the programming model (locking etc.), and that did not really change over time. Also, PostgreSQL supports platforms, some of which may not have particularly great threading support.

I'm not claiming there are no workloads / systems that actually need that many long-lived connections without a connection pool. In my experience it's usually "We don't want to change the app, you have to change the DB!" but fine - then maybe PostgreSQL is not the right match for that application.

> Even if switching from process-based to thread-based model improved the per-connection overhead, the amount of work needed is so huge the benefit / effort ratio is so utterly awful no one is going to do it.

Then other products will emerge and overtake some of PostreSQL's marketshare in the long run. It's already happening in fact. Just like more efficient and easier to configure webservers like nginx and caddy are gaining marketshare over Apache httpd.

I love PostgreSQL and don't want to see it becoming the next Apache httpd, slowly but surely fading. Perhaps FAANGs could fund such refactor.

Perhaps a cheaper solution was to incorporate pgBouncer inside PostgreSQL so it would naturally sit in front of PostreSQL in the default installation without extra configuration.

> Then other products will emerge and overtake some of PostreSQL's marketshare in the long run. It's already happening in fact. Just like more efficient and easier to configure webservers like nginx and caddy are gaining marketshare over Apache httpd.

Maybe, we'll see.

It however assumes the other (thread-based) architecture is somewhat universally better, and I doubt that's how it works. It might help the workloads actually requiring many connections to some extent, but it's also likely to hurt other workloads for which the current architecture works just fine.

But let's assume we decide to do that - such change would be a massive shift in programming paradigm (both internally and for extensions developed by 3rd parties) and would probably require multiple years. That's a huge investment of time/effort, with a lot of complexity, risks and very limited benefits until it's done. I'd bet there'll always be a feature with better cost/benefit ratio.

So reworking the other architecture might actually gain us some users but loose others, and drain insane amount of development resources.

> Perhaps a cheaper solution was to incorporate pgBouncer inside PostgreSQL so it would naturally sit in front of PostreSQL in the default installation without extra configuration.

Yes, I already mentioned that's quite likely to happen. There has already been a patch / project to do exactly that, but it didn't make it into PG14.

> There has already been a patch / project to do exactly that, but it didn't make it into PG14.

That's great to hear!