| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chrstr 1349 days ago
	> Queries using SELECT DISTINCT can now be executed in parallel. This sounds quite interesting, but I would assume it does not always work? I didn't see this mentioned in the linked documentation, does someone know when/how the parallel distinct works?

2 comments

cogman10 1349 days ago

Couldn't tell you the when, but I can tell you the how is likely how you'd expect.

Generally speaking, to do distinct you need a dictionary to look up previously seen values. To do it in parallel you need to make that dictionary thread safe.

For Java, such a thread safe dictionary is made by segmenting the table and synchronizing on the segments. So you'd hash your values, figure out which segment that targets, lock that segment, and then read/update that segment to contain the new value.

I'd assume that postgres is doing a fairly similar trick, The only additional synchronization would be on a linked list of found values. In that case, you could either lock the list and update as new values come in, you could sort those values after the fact, or you could employ a lock free algorithm to add nodes to the list (see lock free queue implementations).

link

tpetry 1349 days ago

Most parallel operations in PG are implemented by simple merge the dataset, work independently and merge the results. I expect the new distinct to behave the same and not work on a shared data structure.

link

jeltz 1349 days ago

You are correct: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

link

chrstr 1349 days ago

Thanks! May be helpful to include this in the documentation, since I guess it will then often depend on the numDistinctRows estimate [1] if the parallel plan is used.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...

link

riku_iki 1349 days ago

"select distinct" is likely now syntax sugar around "select ... group by 1", which worked in parallel for a while.

link