Hacker News new | ask | show | jobs
by agconti 3089 days ago
Absolutely.

- Using a `SearchVectorField` is a must after 500K rows.

- Make keeping this field up to date easy for yourself by populating it using `SeachVector` with a Django pre_save signal or PostgreSQL trigger. This reduces CPU utilization significantly as the parsing and tokenization of the field your searching on is done a head of time.

- Adding a GIN Index on your `SearchVectorField` column will improve performance dramatically.

- You should specify your language configuration for postgres FTS parser. The default parser doesn't do much. It just removes spaces and normalize case. Specifying a language lets the parser make heavier optimizations that noticeably improve performance and the quality of results. If you need support for more then one langue, Django already makes it easy for this configuration to be dynamic.

2 comments

Why not just have a GIN index on the expression to_tsvector(body, 'english') or whatever? Then you don't need to maintain a separate column.
Not the OP, and I haven't used PG regularly for a few years, but back then I vaguely recollect the query optimizer not behaving consistently based on whether you had a full column with a GIN (or GIST) index and a (potentially partial) index on an expression for some reason. In a nutshell it preferred using the full column with an index rather than the expression index.

Even more importantly in some circumstances, having the full column allows the optimizer to pick another index when it's totally relevant, and filter the relevant rows without needing to recompute the TSV one by one for the subset.

If your "document" is based also on columns on other table, as in the example on my article, you can't have a GIN index on your expression, but you can have a GIN index on your specified column.
The Postgres docs suggest using another column. My guess is that an expression index would be too large if it held the tokenized value of all of your FTS documents. These things can and often are entire written documents. Imagine the index size for 2 Million rows of tokenized documents at 2,000 words each.

You might be able to get away with if if you were indexing a less then large amount of very small documents.

I like your expression index idea a lot.

But if you're maintaining a separate ts_vector column, which you then index, you're creating the exact same amount of index data.

Unless you're saying that you would populate the field only on some rows and not all of them, and control this from the app. But you could do that with an expression index, too, assuming the rule is a simple, pure function:

  CREATE INDEX index_posts_on_body
  ON posts (to_tsvector(body, 'english')) 
  WHERE published = true;
or similar.
Very good tips ;-)