| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gtrubetskoy 2244 days ago
	One problem with this article is the number of times the solution involves COUNT(DISTINCT). One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".

2 comments

0az 2244 days ago

What is wrong with DISTINCT?

link

barbegal 2244 days ago

DISTINCT generally requires the results to be sorted which has O(n^2) worst performance so it can have a big performance hit on a query. It is best to make your database structure such that queries only return distinct data. E.g. by disallowing duplicates

link

namibj 2244 days ago

If your sorting algorithm degrades to anywhere near O(n^2) in pathological cases, you're doing something wrong. And even if it's just a kind of timeout/operations-limit to detect pathological cases and just run an in-place mergesort instead. Tail latency/containing pathological data is quite important if there's any interactivity.

link

marcosdumay 2244 days ago

Nearly every time, it's a symptom of bad data normalization.

But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).

link

philshem 2244 days ago

“Think before you DISTINCT”

link

barrkel 2244 days ago

In order to determine the distinct items, the items need to be deduplicated. Generally that's done in only two ways: a hash table that skips items already seen, or a sort followed by a scan that skips over duplicates. The hash table is O(1), but the sort is easier to make parallel without sharing mutable state and has more established algorithms to use when spilling to disk.

link

branko_d 2244 days ago

There is a third way: keep the data pre-sorted in the database (via an index).

link

GlennS 2244 days ago

It covers up bad queries, so you may not see an underlying data duplication problem.

Often better to group explicitly so you know what's actually going on.

link

matwood 2244 days ago

> It covers up bad queries,

Bingo. I used to work with a guy who would see duplicate results and just throw a distinct on his query. I had to keep on him to fix his queries or explain why distinct was correct in this case. My default is that distinct is almost always not the solution.

link

cultofmetatron 2244 days ago

seriously.. I'm building out some functionality using plpgsql and have used it. This is going to be haunting my dreams

link

wesd 2244 days ago

there is possible performance hit [1]. Also, it could mean that the data granularity has not been modeled well if there are duplicate rows.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-a...

link

yread 2244 days ago

Huh? If you have a table with attributeid, sampleid and value how would you count how many samples have a value in any attribute? Exists subquery?

link

matwood 2244 days ago

In your example you must also have another table, 'sample' with all the samples. So yes, you would use an exists or in subquery with the table you suggested.

link