Hacker News new | ask | show | jobs
by samokhvalov 982 days ago
I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.

Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.

https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);

This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.

GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.

Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.

4 comments

One of the ParadeDB authors here, hey! Thanks for pointing this out, you're completely right. That's an oversight on our end. We'll update the benchmarks and re-run them to correct this :)
Great to hear, a benchmark against trigram searching with gin index would also be great. There are multiple ways to do full text search with postgres and they’re all insanely fast and memory efficient. Benchmarking various methods for comparison would be helpful.

https://www.crunchydata.com/blog/postgres-full-text-search-a...

Thanks for sharing, will look to add a benchmark for that as well
I learned the hard way that Gin updates are too slow, and in my case it was not even 100 updates per seconds on average, but could peak to 1000.

How does Pg_bm25 compare here with maintaining the index & performance?

If I am understanding your experience correctly the colloquial wisdom here is to use GIN on static data and GIST on dynamic data.

> In choosing which index type to use, GiST or GIN, consider these performance differences:

> GIN index lookups are about three times faster than GiST

> GIN indexes take about three times longer to build than GiST

> GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 54.3.1 for details)

> GIN indexes are two-to-three times larger than GiST indexes

> As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

https://www.postgresql.org/docs/9.1/textsearch-indexes.html

This sort of thing is more common with postgres than you'd think. I interviewed a candidate once whose company completely replaced querying in their postgres with elasticsearch because they could not figure out how to speed up certain text search queries. Nothing they tried would use the index.
"Preferred Index Types for Text Search" https://www.postgresql.org/docs/current/textsearch-indexes.h... :

> There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.

I had same thought as soon as I read the article, with a gin index the benchmarks would be wildly different and not sure why they didn’t compare against that. Of course a non indexed search is going to be slow.

I was looking for comparison against a gin index specifically, without it pros/cons unclear.

I still can't figure out how pg_trgm is supposed to work for multi-term searches and how to ensure the dictionary table it needs stays up-to-date. Is there a good writeup somewhere?