Hacker News new | ask | show | jobs
On “order by” optimization (dom.as)
33 points by 719Ben 3972 days ago
4 comments

In short, if I understood, the author used

SELECT c,d FROM t WHERE b=X ORDER BY a DESC LIMIT 1

on MySQL and was surprised to find that MySQL was using index over a and in his opinion fully ignored the existence of the index over (b,c) to discover where b is equal to X? I admit I didn't understand his explanation why, especially not why it would maybe do the right thing with the LIMIT 2. I can't imagine that any serious SQL engine would ignore the possibility to use the index in such a case.

LIMIT 2 is also exposed to this issue, at different data ratios (in my tests up to LIMIT 142, I went through the math).
Can you please explain: Does MySQL use index over (b,c) or not where evaluating "where b=X" or not? As I've said, I'd expect any engine which has an index to use it.
Its usually better to be explicit and use aggregates like MAX or MIN instead of a LIMIT of 1.
Can you elaborate on that with an example? I can't imagine when running MAX() or MIN() could be better than a static digit, so an example would help me understand your comment!
As in SELECT MAX(a), c, d FROM t WHERE b=x?

MySQL will return a random c and d, not necessarily the c and d from the row with maximum a...

Goldenkey actually posted

SELECT c,d FROM t WHERE a=(SELECT MAX(a) WHERE b=X from t)

But that post got flagged because he also gave his opinion on the level of understanding the article writer has. The query you posted is actually the one by the article writer.

I see. Thanks.
Right. MySQL's optimizer knows how many records each table has, but not much else. It doesn't do statistical tests on the tables. Depending on table contents, "WHERE b=X" can select any number of records, from zero to the whole database. Whether the optimal strategy is to do the "WHERE b=X" first, possibly getting a large number of hits, or simply making a sequential pass, is not decidable from the database size only.

An alternative would be for the optimizer to do an indexed SELECT on b for "WHERE b=X", and if the number of hits exceeded some threshold, abandon that effort as inefficient and start over using a sequential pass. But MySQL doesn't abandon strategies once selected, or keep such statistics as a hint for future queries. As someone pointed out, Oracle's flagship database product does do that. Anyone know about Postgres?

>>Unfortunately, optimizer doesn’t understand, that there’re humans who are building these systems, and humans have their own rationale and thinking. One of the ideas that a human would have is that if you have 100GB table, you better understand things like data locality and other sorts of efficiencies.

I have not used MySql and cannot comment for that but with my experience with Oracle, i believe its histogram generation process is actually trying to solve the same problem.It has a binning strategy for histogram generation that tries to help with data patterns.

http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm

Great read!