Hacker News new | ask | show | jobs
by atombender 3082 days ago
Why not just have a GIN index on the expression to_tsvector(body, 'english') or whatever? Then you don't need to maintain a separate column.
3 comments

Not the OP, and I haven't used PG regularly for a few years, but back then I vaguely recollect the query optimizer not behaving consistently based on whether you had a full column with a GIN (or GIST) index and a (potentially partial) index on an expression for some reason. In a nutshell it preferred using the full column with an index rather than the expression index.

Even more importantly in some circumstances, having the full column allows the optimizer to pick another index when it's totally relevant, and filter the relevant rows without needing to recompute the TSV one by one for the subset.

If your "document" is based also on columns on other table, as in the example on my article, you can't have a GIN index on your expression, but you can have a GIN index on your specified column.
The Postgres docs suggest using another column. My guess is that an expression index would be too large if it held the tokenized value of all of your FTS documents. These things can and often are entire written documents. Imagine the index size for 2 Million rows of tokenized documents at 2,000 words each.

You might be able to get away with if if you were indexing a less then large amount of very small documents.

I like your expression index idea a lot.

But if you're maintaining a separate ts_vector column, which you then index, you're creating the exact same amount of index data.

Unless you're saying that you would populate the field only on some rows and not all of them, and control this from the app. But you could do that with an expression index, too, assuming the rule is a simple, pure function:

  CREATE INDEX index_posts_on_body
  ON posts (to_tsvector(body, 'english')) 
  WHERE published = true;
or similar.