Hacker News new | ask | show | jobs
by gtrubetskoy 2244 days ago
One problem with this article is the number of times the solution involves COUNT(DISTINCT).

One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".

2 comments

What is wrong with DISTINCT?
DISTINCT generally requires the results to be sorted which has O(n^2) worst performance so it can have a big performance hit on a query. It is best to make your database structure such that queries only return distinct data. E.g. by disallowing duplicates
If your sorting algorithm degrades to anywhere near O(n^2) in pathological cases, you're doing something wrong. And even if it's just a kind of timeout/operations-limit to detect pathological cases and just run an in-place mergesort instead. Tail latency/containing pathological data is quite important if there's any interactivity.
Nearly every time, it's a symptom of bad data normalization.

But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).

“Think before you DISTINCT”
In order to determine the distinct items, the items need to be deduplicated. Generally that's done in only two ways: a hash table that skips items already seen, or a sort followed by a scan that skips over duplicates. The hash table is O(1), but the sort is easier to make parallel without sharing mutable state and has more established algorithms to use when spilling to disk.
There is a third way: keep the data pre-sorted in the database (via an index).
It covers up bad queries, so you may not see an underlying data duplication problem.

Often better to group explicitly so you know what's actually going on.

> It covers up bad queries,

Bingo. I used to work with a guy who would see duplicate results and just throw a distinct on his query. I had to keep on him to fix his queries or explain why distinct was correct in this case. My default is that distinct is almost always not the solution.

seriously.. I'm building out some functionality using plpgsql and have used it. This is going to be haunting my dreams
there is possible performance hit [1]. Also, it could mean that the data granularity has not been modeled well if there are duplicate rows.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-a...

Huh? If you have a table with attributeid, sampleid and value how would you count how many samples have a value in any attribute? Exists subquery?
In your example you must also have another table, 'sample' with all the samples. So yes, you would use an exists or in subquery with the table you suggested.