Hacker News new | ask | show | jobs
by MSM 4012 days ago
There shouldn't be a situation where reading the data + creating the index would be less intensive than just reading the data.

This is from the SQL server perspective so I apologize if all this doesn't translate 100%. Let's say the column you're filtering on is currently not in any index. That field then needs to be pulled out of the clustered index's leaf pages. Since it's unsorted, we'd need to read the entire table- a full scan.

To build the temporary index we'd also need to read in all of the data, so the same amount of work from that perspective, but then we'd have to sort it all, which is very intensive.

If there's a situation where we're doing something analogous to a inner loop join in SQL Server, but it's doing full scans for each iteration, that's a problem with the query optimizer.

5 comments

Temporary index and Hypothetical index are two different things.

Temporary index is a real index that's usually dynamically created by the optimizer to speed up a query. But it's kind of useless because most of the time the cost of creating a one-time index exceed the cost of not using it. Afaik only DB2 as this feature

http://www.centerfieldtechnology.com/PDFs/DB2%20Temp%20Index...

An hypotetical index is just a way to "trick" the planner into thinking that there's an index and see what the query plan would be if that index really existed. The cost of creating this is close to zero.

If you do N self-joins in a row, creating the index on the fly would be a good move, because they'd have to read through the same data multiple times. The real world example that springs to mind is comments that can nest up to e.g. 5 layers, where it kinda makes sense to grab it all in one query like that.

Obvious caveats: I would just have a 'top-level-post-id' so I could grab everything I need and use the foreign keys to arrange them in memory, not for discovering the rows I need, also wowzers doing _any_ joins without an index is silly, forget multiple in one query.

Very true. That said, there are almost always situations where reading the data + creating the index + using it N times would be less intensive than just reading the data N times, for very small N.

I'd love to see Postgres or MongoDB with an index-suggesting logger - I'd pay the cost of the logging to have it run on every query, then a background thread would figure out what the "hot spot" types of queries are, figure out what indices to create concurrently, and show it to me in a web interface where I can click a button and create the indices I need concurrently during a low-traffic period of time.

Does anything out there like this exist?

SQL Server has all this information saved (I'm not paid by Microsoft, I swear). I would suspect something is available for other engines as well.

For example, I can easily grab (what it thinks are) the top 10 most impactful missing indexes and create them. The big issue is that, well, they can be really dumb when looking at a macro level. Do I often query a 10 column table for 8 of the columns? It will likely recommend I index those 8 columns. In a vacuum that might be a great index for this specific query, but now I'm basically maintaining two copies of the full table.

To be able to weigh these pros and cons would be a really nice feat, then DBAs will have to abandon the "It Depends" mantra!

I think indexing is ultimately a trade-off design; the end effect of it is a decision which belongs to a human - what do we make faster at the expense of what?

Probably the best thing a "machine" can do is to give tools to easily perform a reasonably accurate and extensive analysis, but I think any major RDBMS has "good enough" tools to do so.

Interestingly, I think it's important to think who is the target of such tools - the casual developer? the novice dba? the expert dba who works on large datasets? I think even theoretically advanced tools would help only the first case; the other two need to know very well what they're working with.

The PoWA project (https://dalibo.github.io/powa/) + pg_qualstats extension is doing most of this work.

The pg_qualstats extension will gather statistics on WHERE clauses (number of execution, selectivity...) on the fly, and the powa UI can show you which are the most expensive queries, and suggest index creation based on your real workload. It's then your choice to create the indexes.

hypopg should be added soon to compare query EXPLAIN with and without the hypothetical indexes.

There's a python package called dex (https://github.com/mongolab/dex) that helps assist with index recommendations by looking at mongo's oplog and query response times. It's not perfect but better than nothing for alerting people to a missing index. Unfortunately I think it is not under active development anymore and beginning to fall behind with changes in mongodb.
Your scenario is valid for "index organized tables", where the data for a given row is stored in one of the index's leaf pages. Synthesizing an index on-demand for those involves an extraordinary amount of random IO.

In postgres, a table's data is stored in the "heap" (a disk file representing the table, itself; indexes are separate disk files). Consequently, you can do a sequential pass through the table, and then calculate/build the index in RAM (assuming you have sufficient "maintenance_work_mem", in postgres configuration terms, otherwise you'll spill to disk at this step, too), and then write it out sequentially as well — and only then walk the index depth-first to get to the leaf page that points to the disk page in the heap that contains the row you're selecting.

(And even that random depth-first traversal of the index will probably actually happen in RAM, because postgres significantly defers to the kernel's buffer cache to mitigate IO costs, and the index you're traversing is pretty much guaranteed to be cached at this point.)

EDIT: It would apply to MySQL using InnoDB, however, as IIRC that does store table data on the leave pages of the primary key index. Can someone corroborate that?

I think the parent refers to a specific optimization that MySQL doesn't perform:

SELECT ... FROM ... WHERE field IN ( SELECT field2 FROM ... )

MySQL doesn't acceptably optimize such queries (at v5.6, as far as I remember).

This is a more complex case than the one you're pointing - the field2 records may be an intermediate result, logically and/or physically.

For example of the latter, in a typical mysql execution with a relatively large number of records, the subquery records are likely to reside on a temporary swapped table. In this case, the optimizer would need to be able to create the intermediate table with optimal data structure[s], which (if I remember correctly) version 5.6 is not capable of.