| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexpeattie 3752 days ago

I thought this was a great article, and trigram indexes are definitely an amazing tool to have in the arsenal.

I think PostgreSQL fulltext search (with tsvector, tsquery etc.) is given an unfairly bad rap though. I'm not sure that "full text search works best when the text vectors are stored in physical columns with an index" is true - in my experience there's no performance penalty to just indexing the tsvector expression - no need to worry about additional columns or triggers.

I also think the assertion that the key problem is that "full text search is that words are broken up according to the rules defined by the language of the text" is very context-dependent. In many situations that's the most awesome feature of fulltext search. Usually when I search for "cat" I'm not interested in results for catacombs or categories, but when I search for restaurants, results matching restaurant (singular) are relevant too.

I definitely see that for a use-case like GitLab's, where the data includes code, full text search's stemming would be a hindrance rather than a help.

2 comments

brightball 3752 days ago

One of the perks of using columns is that you can setup multiple columns with different dictionaries and specify which one you're searching against.

I'm not sure how easy it would be to ensure that the index would get picked up if searching on a weighted field with multiple columns specified or if it would require that the program know in advance what those columns were. Would the order matter for the index to be used?

Additionally, a column can also easily included relevant search data from related records in other tables or even datasources if desired.

link

alexpeattie 3752 days ago

There's nothing stopping you indexing multiple columns with different dictionaries either :). And indexing works beautifully with multiple weighted columns.

The related records point is true though - indexes don't help if you're trying to implement multi-table search. Although then I'd argue a materialized view (refreshed with triggers) might be better than putting more columns into one of the data tables.

link

brightball 3752 days ago

I will have to tinker with that then...

link

YorickPeterse 3752 days ago

> I'm not sure that "full text search works best when the text vectors are stored in physical columns with an index" is true

It's mostly based on my past experiences. At my previous gig we indexed quite a bit of data using PostgreSQL's full text search system and we noticed significantly improved performance when using physical columns containing text vectors over just using GIN indexes. It's been a while so the details are a bit fuzzy.

link

alexpeattie 3752 days ago

Ah ok, interesting :). We were weighing up exactly these kind of tradeoffs a few weeks ago, and the benchmarks we did put the 2 approaches more or less neck and neck, so maybe PG has managed to optimize things a bit? Or, our search is over hundreds of thousands of rows, so maybe things only slow down in the millions and above...

link

YorickPeterse 3752 days ago

If I remember correctly the tables we ran full text search on didn't have millions of rows, and we used PostgreSQL's simple/basic language (whichever one of the two it was). We did also run a similarity filter (using the pg_trgm similarity() function), maybe we also had trigram indexes on top. I can't really remember.

link