Hacker News new | ask | show | jobs
by craig_asp 2481 days ago
"Pull out the names of only the columns you need instead of using SELECT * to speed things up further."

This is not performance-related in most cases. Unless the bottleneck is in the amount of data being transferred to a client over a network connection and there is a large amount of columns, or you have an index which matches the limited column list query exactly, there would be no performance difference in SELECT * vs SELECT <column list>.

In columnar dbs it does matter because the less columns you select the less data gets accessed on disk. However, this does not hold true for row stores because data is stored in such a way that the whole row gets accessed no matter how many columns get specified in the query.

There are many other good reasons why SELECT * is acceptable only in development queries, but performance is not one of them.

7 comments

Some joins may be completely eliminated if the fields coming from them are omitted from the SELECT list. This happens when the DBMS can prove (through FOREIGN KEYS) that the joined row always exists, without having to ever physically retrieve that row.

This is especially important when querying views (or inline functions, if your DBMS supports them) built on top of several other layers of views.

In terms of performance, the primary rationale for only retrieving columns you need would be to allow the database engine to select a proper index to use, specifically, one that covers all the columns you are making use of. This can actually have a massive impact on the execution plan, far more than simply fetching say 2x more data, as you point out.

In fact, I'd go so far as to say that the practice of selecting all columns makes it practically impossible to make use of covering indexes for high impact queries.

It's not about choosing the right index. The point here is that the database can only read the index (without going into the table) if all the columns you need and use in where clause are in the index. Usually index segments are smaller, so it could be faster. I am talking from the Oracle perspective, but I believe PostgreSQL would work in a similar way.
Yes, of course, indexes work like that with where clauses. However here I was referring to covering indexes, where the columns in your _select_ are also present (after the ones in the where clause / join / predicate). This prevents having to do extra seeks into the clustered index to fetch that data, and can make very significant differences, especially if row counts are high.

Clearly, this type of index is both expensive to maintain, and the benefits are lessened if you include every column in a table in the covering index. So, to actually have real benefits, you need to have carefully crafted queries which only select what they need, and these being matched to carefully crafted covering indexes. Basically, you cannot have these - which are one of the better weapons you have for performance - if you always select everything.

I am talking about INDEX FAST FULL SCAN (Oracle) for example. If you do 'select distinct last_name from users;' and you have an index on users (last_name) you do not need to read the table at all. It is enough to read the index.

Otherwise it does not matter how many columns you select, because you will have to read all the columns for every row anyway.

> Otherwise it does not matter how many columns you select, because you will have to read all the columns for every row anyway.

It does matter. An index can actually contain a cache of values for some columns. This is done with INCLUDE statement.

     CREATE INDEX idx
         ON book ( author_id )
         INCLUDE ( book_title )

This select will use only index:

       select author_id, book_title
       from books
       where author_id = 123

This select will use index, and then have to follow and fetch data out of rows:

       select *
       from books
       where author_id = 123
https://use-the-index-luke.com/blog/2019-04/include-columns-...
I fully agree. With both comments, both yours and branko_d's.

The way I read the article, it seems to suggest that by removing rows from the select list would somehow always result in better performance. In my opinion, in the majority of cases there will be no measurable impact on performance.

In some specific cases, however, yes, it can make a big difference. In postgres it also seems like selecting all columns explicitly, but out of order can be detrimental to performance:

https://www.postgresql.org/message-id/5562FE06.9030903@lab.n...

So, it depends :)

PostgreSQL can use multiple indexes. So if you have millions of rows, and 3 indexes covering the table, on 3 different columns. PostgreSQL can use any & all of those indexes to filter the data.

Selecting only the columns you need just prevents the database from having to go back to the clustered index to retrieve the remaining information that wasn't included on the index.

Selecting * from a table has no impact on the indexes used. But it can be expensive to need to go back to the clustered index to get the additional information.

> not true for row stores because data is stored in such a way that the whole row gets accessed no matter how many columns get specified in the query.

This is not true, at least not with the likes of MS SQL Server / postgres / Oracle / etc. If the query can be satisfied with just what is in an index the clustered index or heap (where all the row data is together) does not need to be referenced. Worse: if you have off-page data (NVARCHAR(MAX) columns with values that aren't very short, to give one common example in the case of SQL Server) then not only is the engine reading the CI/heap unnecessarily to serve your query but it is potentially making extra unnecessary per-value page reads on top of that.

Also for more complex queries that cause the building of intermediate results in memory or worse on disk, anything where you see a sort step in the query plan for instance, you are increasing the amount of data that needs to be spooled in this way. In some circumstances this might happen to the date more than once during a single run of your query.

Or if columns are actually the result of UDFs - they will be getting run per value per row even though the result is not needed, and that can also affect choices the query planner makes wrt parallelism options. You might not know you query is "more complex" either: do you know for sure that you are referencing a simple table or a view that is doing some jiggery-pokery to derive values? It might be a simple table now but that may change in later versions of the model/application, and now you have that extra computation happening for values you might not actually care about. It doesn't have to be something as complex as a table being replaced by a view that is doing extra work: the entity might later have a free-form note record attached to it in the form of a VARHCAR(MAX) column and those extra off-page reads are hitting your query's performance server-side (not just in terms of what travels over the network) when you don't actually need the results of those reads.

> There are many other^H^H^H^H^H good reasons why SELECT is acceptable only in development queries*

I'd say "to be avoided where possible outside of dev/test queries" as there are always exceptions where it isn't possible due to [bad] design elsewhere, but yes.

> but performance is not one of them.

This I disagree with. It can definitely affect performance. And anyway, it might not affect performance on this particular query, at least not by a measurable amount, being selective about what you select is a good habit to train into yourself.

Nobody is saying that it's not a good habit to form. It would be better to advise to review your column lists and remove redundant columns. Firstly because schemas change, and then because yes, you might hit an index (especially if you know it's there, as you should) and get that speed-up as well.

My issue with this advice is that it enforces the idea that "less columns in select list" = "less data accessed" _in all cases_, which as we all agree is not true. Even more so if you have a relatively well designed database, with no crazy amount of views on views (or any for that matter), UDFs, huge columns with binary data, etc, etc.

PS: "...result of UDFs - they will be getting run per value per row..." -> not always.. Inline tvfs get expanded, so you should probably be quite careful with all other user defined functions in SQL Server anyway.

> it enforces the idea that "less columns in select list" = "less data accessed" _in all cases_

I for one try to avoid absolutes like that. "less columns in select list = potentially less data accessed & processed" is a better wording. Along with "and there are no cases, unless there is a QP bug around the matter than I'm unaware of, where it changes things for the worse" if I'm feeling more wordy.

The problem with that though is the some (I've worked with some difficult people!) see the "potentially" as a reason to just drop in the * "for now" - their reasoning being it is a premature optimisation. Why type the extra if it won't definitely improve performance, despite the warnings that while it might not affect performance now it may in future, possibly due to changes elsewhere making this a trickier fix point to find, and even ignoring the performance issue it is good practise for other reasons.

> more so if you have a relatively well designed database

Oh, how I long for such nirvana!

> Inline tvfs get expanded

But scalar functions do not, well not until you are using 2019 which is due for release soon, and even then only some can be. And even if your SVF doesn't result in any reads, so inlining in this sense doesn't matter, there is extra CPU work going on.

(assuming SQL Server of course, details will of course differ elsewhere)

For query plans using a sort node, there can be a major difference in performance depending on the row width.
I just reject SELECT * queries by CI/CD because schema changes will change what is returned by such queries.
If you sort the result of a SELECT in such a way that can't use an index (i.e. a file-sort), you will get significant performance gains from MySQL by minimizing the columns accessed. MySQL materializes the rows then sorts them. Wider rows, more data shuffling.
This largely depends on the RDBMS. When generating estimates for memory grants, Sql Server will look at the row size as well expected counts. I’ve had particularly bad ORM queries throw together products of joined tables with all possible columns in the projection. Between the insane memory grant request and the IO throughput for retrieval, you’d be surprised at how much SELECT * can affect query performance