Hacker News new | ask | show | jobs
by chrstr 1349 days ago
> Queries using SELECT DISTINCT can now be executed in parallel.

This sounds quite interesting, but I would assume it does not always work? I didn't see this mentioned in the linked documentation, does someone know when/how the parallel distinct works?

2 comments

Couldn't tell you the when, but I can tell you the how is likely how you'd expect.

Generally speaking, to do distinct you need a dictionary to look up previously seen values. To do it in parallel you need to make that dictionary thread safe.

For Java, such a thread safe dictionary is made by segmenting the table and synchronizing on the segments. So you'd hash your values, figure out which segment that targets, lock that segment, and then read/update that segment to contain the new value.

I'd assume that postgres is doing a fairly similar trick, The only additional synchronization would be on a linked list of found values. In that case, you could either lock the list and update as new values come in, you could sort those values after the fact, or you could employ a lock free algorithm to add nodes to the list (see lock free queue implementations).

Most parallel operations in PG are implemented by simple merge the dataset, work independently and merge the results. I expect the new distinct to behave the same and not work on a shared data structure.
Thanks! May be helpful to include this in the documentation, since I guess it will then often depend on the numDistinctRows estimate [1] if the parallel plan is used.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...

"select distinct" is likely now syntax sugar around "select ... group by 1", which worked in parallel for a while.