On “order by” optimization

Y	Hacker News new \| ask \| show \| jobs

	On “order by” optimization (dom.as)
	33 points by 719Ben 4018 days ago

4 comments

acqq 4018 days ago

In short, if I understood, the author used

SELECT c,d FROM t WHERE b=X ORDER BY a DESC LIMIT 1

on MySQL and was surprised to find that MySQL was using index over a and in his opinion fully ignored the existence of the index over (b,c) to discover where b is equal to X? I admit I didn't understand his explanation why, especially not why it would maybe do the right thing with the LIMIT 2. I can't imagine that any serious SQL engine would ignore the possibility to use the index in such a case.

link

midom 4017 days ago

LIMIT 2 is also exposed to this issue, at different data ratios (in my tests up to LIMIT 142, I went through the math).

link

acqq 4016 days ago

Can you please explain: Does MySQL use index over (b,c) or not where evaluating "where b=X" or not? As I've said, I'd expect any engine which has an index to use it.

link

goldenkey 4018 days ago

Its usually better to be explicit and use aggregates like MAX or MIN instead of a LIMIT of 1.

link

degenerate 4018 days ago

Can you elaborate on that with an example? I can't imagine when running MAX() or MIN() could be better than a static digit, so an example would help me understand your comment!

link

crishoj 4018 days ago

As in SELECT MAX(a), c, d FROM t WHERE b=x?

MySQL will return a random c and d, not necessarily the c and d from the row with maximum a...

link

acqq 4018 days ago

Goldenkey actually posted

SELECT c,d FROM t WHERE a=(SELECT MAX(a) WHERE b=X from t)

But that post got flagged because he also gave his opinion on the level of understanding the article writer has. The query you posted is actually the one by the article writer.

link

crishoj 4017 days ago

I see. Thanks.

link

Animats 4018 days ago

Right. MySQL's optimizer knows how many records each table has, but not much else. It doesn't do statistical tests on the tables. Depending on table contents, "WHERE b=X" can select any number of records, from zero to the whole database. Whether the optimal strategy is to do the "WHERE b=X" first, possibly getting a large number of hits, or simply making a sequential pass, is not decidable from the database size only.

An alternative would be for the optimizer to do an indexed SELECT on b for "WHERE b=X", and if the number of hits exceeded some threshold, abandon that effort as inefficient and start over using a sequential pass. But MySQL doesn't abandon strategies once selected, or keep such statistics as a hint for future queries. As someone pointed out, Oracle's flagship database product does do that. Anyone know about Postgres?

link

th3iedkid 4018 days ago

>>Unfortunately, optimizer doesn’t understand, that there’re humans who are building these systems, and humans have their own rationale and thinking. One of the ideas that a human would have is that if you have 100GB table, you better understand things like data locality and other sorts of efficiencies.

I have not used MySql and cannot comment for that but with my experience with Oracle, i believe its histogram generation process is actually trying to solve the same problem.It has a binning strategy for histogram generation that tries to help with data patterns.

http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm

link

zapt02 4018 days ago

Great read!

link