| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by quassnoi 5883 days ago

> He's cheating a bit. He created two tables each with 1 million rows, offset by 999,000. (i.e. 1 - 1,000,000 and 999,000 - 1,999,000)

Of course this is cheating, but only to demonstrate the point :)

> But this is very unlikely in the real world. In the real world you use sequential IDs, and almost all the IDs exist. So seek is not useful.

Yes, and that's why the HASH JOIN is almost always used instead by the real optimizers :)

However, there are situations when the gaps are expected. Say, the right recordset is small compared to the left one. Usually, a NESTED LOOPS is used in this case with the right recordset leading, but each loop uses a full-fledged seek from the top of the B+Tree which can be inefficient when the right recordset is not so small.

Using a statistics-assisted MERGE JOIN in this case would help to mix the seeks and the scans.

Say, if the statistics say "current right is 500, current left is 300, there are 40000 records from 0 to 1000 in the left recordset", we know that most probably we'll need to skip 8000 records to reach the right pointer and we better seek.

But if on next step it will say "current right is 501, current left is 500, there are 40000 records from 0 to 1000 in the left recordset", we know that there are only 40 records left which are most probably in one page and we can scan instead.

It can be thought of as an improved NESTED LOOPS which caches the current position.