Hacker News new | ask | show | jobs
by pilif 3554 days ago
While it's ok for our purposes, I would wish for a bit better customisability of the text parser and it definitely needs better support for compound words to be perfect.

The first issue is with relation to https://www.postgresql.org/docs/9.6/static/textsearch-parser...: The documentation says

> At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications

and it really means it - changing the behaviour of this component is not possible unless you write a completely different parser in C which, while possible is no fun experience.

We're using the full text feature over product data and we're having to work around the parser sometimes too eagerly detecting email addresses and URLs which messes with properly detecting brand names which might contain some of these special characters.

The other problem is the compound support. A lot of our data is in German which like other languages likes to concatenate nouns.

For example, you'd absolutely want to find the term "Weisswürste" for the query "wurst" (note the concatenation and the added umlaut for the plural in wurst).

Traditionally, you do this using a dictionary and while Postgres has support for ispell and hunspell dictionaries, only hunspell has acceptable compound support, which in turn isn't supported by Postgres.

So we've ended up using a hacked ispell dictionary where we have to mark all known compounds which is annoying and error-prone.

Also, once you have to use a dictionary, you end up with a further issue: Loading the dictionary takes time and due to the way how Postgres currently works, it has to happen per connection. In our case, with the 20MB hacked german ispell dictionary, this takes ~0.5s which is way too long.

The solution for this is to use a connection pooler in front of Postgres. This works fine but, of course, adds more overhead.

The other solution is http://pgxn.org/dist/shared_ispell/, but I've had multiple postmaster crashes due to corrupted shared memory (thank you, Postgres, for crashing instead of corrupting data) related to that extension, so I would not recommend this for production use.

Lucene and by extension ElasticSearch has much better built-in text analysis features so we could probably fix the parser and compound issue, but that would of course mean even more additional infrastructure, plus, probably some performance issues as we, unfortunately, absolutely cannot return all the FTS matches but instead have to check them for other reasons why they must not be shown which, of course, uses the database again and I'm wary of putting all that logic somehow into ES as well.

This is why we currently deal with the postgres tsearch limitations. But sooner or later, we'd probably want to bite the bullet and go dedicated solution.

2 comments

Do you rank full-text search results using something like ts_rank? If yes, do you suffer from slow queries?
We use it, but we don't suffer slow queries in our case.
I'd like to know more about your case, because my own experience is that ordering by ts_rank causes a big slowdown.

PostgreSQL documentation says: "Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."

Some PostgreSQL developers are working on improving this by using indexes only to compute the ranking, but the related patches are not done yet.

What is the size of your data set (number of rows and size on disk) and the average response time?

I do some searching with pgsql with tiny datasets and ts_rank, on a 10GB dataset of 11 million rows (mostly chat data), and get response times for ranking over all of it around 10-100ms on a cheap OVH 5€ VPS.

Query: https://paste.kde.org/pcxyg0fay | Explain: https://explain.depesz.com/s/jN3V (101ms)

Sometimes queries end up even a lot faster, for example the same as above, but searching for "c plus plus", runs in this plan + runtime: https://explain.depesz.com/s/NPOc (11ms)

Thanks a lot for sharing this!

Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.

Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

> What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

It’s far more problematic to actually load that data from disk.

Ordering can get expensive no matter what just base on how many things you're actually sorting. Ideally, if you can find a way to limit the size of the data set before the ranking sort you'll see a big improvement.
Currently, the only way to make ranking tolerable is to limit the size of the data set before the ranking sort. This is what I did in my tests.

But it looks like it could be possible to massively improve ranking performance by storing all necessary information directly in the GIN index, as proposed here:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

Someone here mentioned ZomboDB[1]. Would that help you?

[1] https://github.com/zombodb/zombodb