Hacker News new | ask | show | jobs
by danielbarla 2481 days ago
In terms of performance, the primary rationale for only retrieving columns you need would be to allow the database engine to select a proper index to use, specifically, one that covers all the columns you are making use of. This can actually have a massive impact on the execution plan, far more than simply fetching say 2x more data, as you point out.

In fact, I'd go so far as to say that the practice of selecting all columns makes it practically impossible to make use of covering indexes for high impact queries.

3 comments

It's not about choosing the right index. The point here is that the database can only read the index (without going into the table) if all the columns you need and use in where clause are in the index. Usually index segments are smaller, so it could be faster. I am talking from the Oracle perspective, but I believe PostgreSQL would work in a similar way.
Yes, of course, indexes work like that with where clauses. However here I was referring to covering indexes, where the columns in your _select_ are also present (after the ones in the where clause / join / predicate). This prevents having to do extra seeks into the clustered index to fetch that data, and can make very significant differences, especially if row counts are high.

Clearly, this type of index is both expensive to maintain, and the benefits are lessened if you include every column in a table in the covering index. So, to actually have real benefits, you need to have carefully crafted queries which only select what they need, and these being matched to carefully crafted covering indexes. Basically, you cannot have these - which are one of the better weapons you have for performance - if you always select everything.

I am talking about INDEX FAST FULL SCAN (Oracle) for example. If you do 'select distinct last_name from users;' and you have an index on users (last_name) you do not need to read the table at all. It is enough to read the index.

Otherwise it does not matter how many columns you select, because you will have to read all the columns for every row anyway.

> Otherwise it does not matter how many columns you select, because you will have to read all the columns for every row anyway.

It does matter. An index can actually contain a cache of values for some columns. This is done with INCLUDE statement.

     CREATE INDEX idx
         ON book ( author_id )
         INCLUDE ( book_title )

This select will use only index:

       select author_id, book_title
       from books
       where author_id = 123

This select will use index, and then have to follow and fetch data out of rows:

       select *
       from books
       where author_id = 123
https://use-the-index-luke.com/blog/2019-04/include-columns-...
I fully agree. With both comments, both yours and branko_d's.

The way I read the article, it seems to suggest that by removing rows from the select list would somehow always result in better performance. In my opinion, in the majority of cases there will be no measurable impact on performance.

In some specific cases, however, yes, it can make a big difference. In postgres it also seems like selecting all columns explicitly, but out of order can be detrimental to performance:

https://www.postgresql.org/message-id/5562FE06.9030903@lab.n...

So, it depends :)

PostgreSQL can use multiple indexes. So if you have millions of rows, and 3 indexes covering the table, on 3 different columns. PostgreSQL can use any & all of those indexes to filter the data.

Selecting only the columns you need just prevents the database from having to go back to the clustered index to retrieve the remaining information that wasn't included on the index.

Selecting * from a table has no impact on the indexes used. But it can be expensive to need to go back to the clustered index to get the additional information.