Hacker News new | ask | show | jobs
by danpalmer 433 days ago
> Mistake #1: Calculating tsvector On-the-Fly (Major issue)

I'm shocked that the original post being referred to made this mistake. I recently implemented Postgres FTS in a personal project, and did so by just reading the Postgres documentation on FTS following the instructions. The docs lead you through the process of creating the base unoptimized case, and then optimising it, explaining the purpose of each step and why it's faster. It's really clear that is what it's doing, and I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

5 comments

This is not my area of expertise so take this with a grain of salt, but my initial instinct was to question why you would need to store the tsvector both in the table and in the index (because the tsvector values will in fact be stored losslessly in a GIN index).

The PG docs make it clear that this only affects row rechecks, so this would only affect performance on matching rows when you need to verify information not stored in the index, e.g. queries with weighted text or queries against a lossy GiST index. It's going to be use-case dependent but I would check if your queries need this before using up the additional disk space.

If only Postgres had Virtual Generated Columns. Not being snarky; MySQL has had them for ages, and they are a perfect fit for this: takes up essentially zero disk space, but you can index it (which is of course stored).

It is, in my mind, the single biggest remaining advantage MySQL has. I used to say that MySQL’s (really, InnoDB) clustering index was its superpower when yielded correctly, but I’ve done some recent benchmarks, and even when designing schema to exploit a clustered index, Postgres was able to keep up in performance.

EDIT: the other thing MySQL does much better than Postgres is “just working” for people who are neither familiar with nor wish to learn RDBMS care and feeding. Contrary to what the hyperscalers will tell you, DBs are special snowflakes, they have a million knobs to turn, and they require you to know what you’re doing to some extent. Postgres especially has the problem of table bloat and txid buildup from its MVCC implementation, combined with inadequate autovacuum. I feel like the docs should scream at you to tune your autovacuum settings on a per-table basis once you get to a certain scale (not even that big; a few hundred GB on a write-heavy table will do). MySQL does not have this problem, and will happily go years on stock settings without really needing much from you. It won’t run optimally, but it’ll run. I wouldn’t say the same about Postgres.

Virtual generated columns are not required to allow an index to be used in this case without incurring the cost of materializing `to_tsvector('english', message)`. Postgres supports indexing expressions and the query planner is smart enough to identify candidate on exact matches.

I'm not sure why the author doesn't use them but it's clearly pointed out in the documentation (https://www.postgresql.org/docs/current/textsearch-tables.ht...).

In other words, I believe they didn't need a `message_tsvector` column and creating an index of the form

  CREATE INDEX idx_gin_logs_message_tsvector
  ON benchmark_logs USING GIN (to_tsvector('english', message))
  WITH (fastupdate = off);
would have allowed queries of the form

  WHERE to_tsvector('english', message) @@ to_tsquery('english', 'research')
to use the `idx_gin_logs_message_tsvector` index without materializing `to_tsvector('english', message)` on disk outside of the index.

Here's a fiddle supporting it https://dbfiddle.uk/aSFjXJWz

You are correct, I missed that. In MySQL, functional indices are implemented as invisible generated virtual columns (and there is no vector index type supported yet that I'm aware of), but Postgres has a more capable approach.
TIL I wasn't aware MySQL functional indices were implemented using virtual columns [0]

[0] https://dev.mysql.com/doc/refman/8.4/en/create-index.html#cr...

I had the same question when reading the article, why not just index the expression?
Yes (very exciting!), but you won’t be able to index them, and that’s really where they shine, IMO.

Still, I’m sure they’ll get there. Maybe they’ll also eventually get invisible columns, though tbf that’s less of a problem for Postgres as it is for MySQL, given the latter’s limited data types.

You can index arbitrary expressions, though, including indexing the same expression used to define the invisible column, right?
I hope OrioleDB will succeed in replacing Postgres' high maintenance storage engine with something that just works.
MySQL logical replication isn’t quite foolproof but it’s vastly easier than anything PostgreSQL offers out of the box. (I hope I’m wrong!)
I think they’re about the same in complexity, other than that Postgres offers more options. MySQL did have logical replication long before Postgres, so I’ll give it that.

Postgres has one option for replication that is a godsend, though: copy_data. This lets you stand up a new replica without having to first do a dump / restore (assuming your tables are small enough / your disk is large enough, since the primary will be holding WAL during the initial sync). Tbf, MySQL doesn’t need that as much, because it offers parallel dump and restore, even on a single table.

I mean, technically any database with triggers can have generated columns, but PostgreSQL has had generated columns since version 13. Current version is 17.

https://www.postgresql.org/docs/current/ddl-generated-column...

I can’t think of any advantage of a virtual generated column over a generated column for something like a search index where calculating on read would be very slow.

Postgres has been able to create indexes based on the output of functions forever though, which does the job here too.

The advantage is when you want to store something for ease of use, but don’t want the disk (and memory, since pages read are loaded into the buffer pool) hit. So here, you could precompute the vector and index it, while not taking the double hit on size.
That’s the same benefit in Postgres as creating an index with the result of function.
That’s just syntax sugar for a trigger. Not really a big advantage.
It's not only an additional disk space, it also a need to sync it with main column, using triggers or whatever, and have much bigger backups. Why "initial instinct was to question", though? I don't see downsides still, unless yeah, weighted queries or need to search in joined tables etc.
It's a generated column, so there's no overhead to keep it up to date, but all generated columns in PG are stored. The corpus for text search will be stored 1x in the source column, 1x as a tsvector in the index, and an additional 1x in the generated column if you do so. That's a 50% increase in disk space.
There is still the overhead of updating that extra space.
That is definitely an issue. And that seems like a win for pg_search. And as siblings note PG 18 will have virtual generated, indexable columns, so that advantage for pg_search will go away.
I’ve been a Postgres FTS advocate for over a decade since replacing a Solr search with it and getting easier maintenance, more flexibility with queries and virtually no difference in speed.

It’s pretty great.

Elastic is on a different level for a lot of use cases, but pg is more than enough for the vast majority of workloads.

What’s the biggest scale you’ve used Postgres search for?
A site with about 300,000 users where we were still scaling it vertically.
> I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

Small writing note, I probably would've swapped the order of those. Hanlon's Razor and all. :)

Perhaps the most generous interpretation is that the authors were writing an article for people who do the naïve thing without reading the docs. There are quite a few people in that category.
> I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

vibe sysadminning, bro