Hacker News new | ask | show | jobs
by tremon 1903 days ago
Avoid functions in WHERE clauses

Avoid them on the column-side of expressions. This is called sargability [1], and refers to the ability of the query engine to limit the search to a specific index entry or data range. For example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a full table scan and the SUBSTRING function will be evaluated for every row, while WHERE field LIKE "A%" can use a partial index scan, provided an index on the field column exists.

Prefer = to LIKE

And therefore this advice is wrong. As long as your LIKE expression doesn't start with a wildcard, LIKE can use an index just fine.

Filter with WHERE before HAVING

This usually isn't an issue, because the search terms you would use under HAVING can't be used in the WHERE clause. But yes, the other way around is possible, so the rule of thumb is: if the condition can be evaluated in the WHERE clause, it should be.

WITH

Be aware that not all database engines perform predicate propagation across CTE boundaries. That is, a query like this:

  WITH allRows AS (
    SELECT id,
           result = difficult_calculation(col)
    FROM table)
  SELECT result
  FROM allRows
  WHERE id = 15;
might cause the database engine to perform difficult_calculation() on all rows, not just row 15. All big databases support this nowadays, but it's not a given.

[1] https://en.wikipedia.org/wiki/Sargable

5 comments

Also, some databases allow you to index the result of a function. Oracle calls them "function-based indexes". PostgreSQL seems to call them "indexes on expressions".

And MySQL seems to support "generated columns" which can be "virtual" and can have indexes. (Although in that case the expression lives in the column definition, so it's not actually in a where clause.)

Also, I guess some databases probably let you have an index on a view, which could be another way.

So if you really need a function in your where clause, there may very well be a way to do it efficiently. Of course, the usual caveat applies that it requires more I/O to maintain more indexes.

MySQL 8.0 actually added support for functional indexes, but I found out the hard way that they don't work the same way as indexing generated columns. For me the biggest issue were the caveats around using the indexed column in conditions, which resulted in the index not being used in surprising situations. Anyways, I had to revert to generated columns which was a shame because it was a feature I had looked forward to using.

I found this article on the topic to be helpful:

https://saveriomiroddi.github.io/An-introduction-to-function...

Advice would have to be tailored to specific database technologies and probably specific versions.

For example, in Apache Impala and Spark, "Prefer = to LIKE" is good advice, especially in join conditions, where an equijoin would allow the query planner to use a Hash Join, whereas a non equijoin limits the query planner to a Nested Loop join.

This is ultimately my problem with databases. We use the term as a catchall, but every implementation is different and is unified only in that they store tables and can respond to SQL.

People treat deciding your app will have a database as a design decision when in reality it is only about 10% of a design decision.

That's a challenge with meatspace infrastructure too. You can have a standardized design for (e.g.) an airport, but "commercial jets" actually represents a wide variety of vehicles with different needs and tolerances, so all designs have to be adapted to the specific circumstances.
Postgres and the likes allows trigram indexes for like queries, and expression based indexes just fine. This argument doesn't really pass the smell test.
I learned about trigrams when working on a personal project and trying to figure out how to do full text searches on postgres.

Good blog post about it : https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...

The docs: https://www.postgresql.org/docs/current/pgtrgm.html

An argument for using the where syntax in simplish adhoc queries: It is a lot shorter and has simpler syntax. Speed is more important than maintainability in this context and it shouldn't result in errors.

Basically, instead of writing

"inner join table2 on"

you can just write "and" and put it after your other where clauses.

It doesn't result in errors because the query will fail if referring to fields from the second table when there is no join.

Could even put the join where clauses on a separate line to split them out from the other where clauses.

It does require reworking the query if not doing an inner join but that is what at least I usually want when doing ad hoc queries. Agree that the join syntax should be used in production code.

Prefer EXISTS to IN is also a bit odd, as the latter is trivially transformed to the former. The DBMS I work with does it universally.
There is a semantic difference, in the context of handling NULLs: EXISTS only checks the size of the result set, so even a subquery that returns (NULL, NULL, NULL) will satisfy the EXISTS condition. IN, on the other hand, performs data equality checks on the results, so NULL rows will never yield a positive match.

Usually, this doesn't make a difference as an EXISTS check is mostly used on the primary key (or business key) of a table, which is (hopefully) non-nullable. But it can sometimes give surprising results when using EXISTS on a nullable column, or for worse results, NOT IN.

I’ve had them get treated equally and differently both in SQL Server.