Hacker News new | ask | show | jobs
by dspillett 2481 days ago
> not true for row stores because data is stored in such a way that the whole row gets accessed no matter how many columns get specified in the query.

This is not true, at least not with the likes of MS SQL Server / postgres / Oracle / etc. If the query can be satisfied with just what is in an index the clustered index or heap (where all the row data is together) does not need to be referenced. Worse: if you have off-page data (NVARCHAR(MAX) columns with values that aren't very short, to give one common example in the case of SQL Server) then not only is the engine reading the CI/heap unnecessarily to serve your query but it is potentially making extra unnecessary per-value page reads on top of that.

Also for more complex queries that cause the building of intermediate results in memory or worse on disk, anything where you see a sort step in the query plan for instance, you are increasing the amount of data that needs to be spooled in this way. In some circumstances this might happen to the date more than once during a single run of your query.

Or if columns are actually the result of UDFs - they will be getting run per value per row even though the result is not needed, and that can also affect choices the query planner makes wrt parallelism options. You might not know you query is "more complex" either: do you know for sure that you are referencing a simple table or a view that is doing some jiggery-pokery to derive values? It might be a simple table now but that may change in later versions of the model/application, and now you have that extra computation happening for values you might not actually care about. It doesn't have to be something as complex as a table being replaced by a view that is doing extra work: the entity might later have a free-form note record attached to it in the form of a VARHCAR(MAX) column and those extra off-page reads are hitting your query's performance server-side (not just in terms of what travels over the network) when you don't actually need the results of those reads.

> There are many other^H^H^H^H^H good reasons why SELECT is acceptable only in development queries*

I'd say "to be avoided where possible outside of dev/test queries" as there are always exceptions where it isn't possible due to [bad] design elsewhere, but yes.

> but performance is not one of them.

This I disagree with. It can definitely affect performance. And anyway, it might not affect performance on this particular query, at least not by a measurable amount, being selective about what you select is a good habit to train into yourself.

1 comments

Nobody is saying that it's not a good habit to form. It would be better to advise to review your column lists and remove redundant columns. Firstly because schemas change, and then because yes, you might hit an index (especially if you know it's there, as you should) and get that speed-up as well.

My issue with this advice is that it enforces the idea that "less columns in select list" = "less data accessed" _in all cases_, which as we all agree is not true. Even more so if you have a relatively well designed database, with no crazy amount of views on views (or any for that matter), UDFs, huge columns with binary data, etc, etc.

PS: "...result of UDFs - they will be getting run per value per row..." -> not always.. Inline tvfs get expanded, so you should probably be quite careful with all other user defined functions in SQL Server anyway.

> it enforces the idea that "less columns in select list" = "less data accessed" _in all cases_

I for one try to avoid absolutes like that. "less columns in select list = potentially less data accessed & processed" is a better wording. Along with "and there are no cases, unless there is a QP bug around the matter than I'm unaware of, where it changes things for the worse" if I'm feeling more wordy.

The problem with that though is the some (I've worked with some difficult people!) see the "potentially" as a reason to just drop in the * "for now" - their reasoning being it is a premature optimisation. Why type the extra if it won't definitely improve performance, despite the warnings that while it might not affect performance now it may in future, possibly due to changes elsewhere making this a trickier fix point to find, and even ignoring the performance issue it is good practise for other reasons.

> more so if you have a relatively well designed database

Oh, how I long for such nirvana!

> Inline tvfs get expanded

But scalar functions do not, well not until you are using 2019 which is due for release soon, and even then only some can be. And even if your SVF doesn't result in any reads, so inlining in this sense doesn't matter, there is extra CPU work going on.

(assuming SQL Server of course, details will of course differ elsewhere)