Hacker News new | ask | show | jobs
by ngrilly 3550 days ago
I'd like to know more about your case, because my own experience is that ordering by ts_rank causes a big slowdown.

PostgreSQL documentation says: "Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."

Some PostgreSQL developers are working on improving this by using indexes only to compute the ranking, but the related patches are not done yet.

What is the size of your data set (number of rows and size on disk) and the average response time?

2 comments

I do some searching with pgsql with tiny datasets and ts_rank, on a 10GB dataset of 11 million rows (mostly chat data), and get response times for ranking over all of it around 10-100ms on a cheap OVH 5€ VPS.

Query: https://paste.kde.org/pcxyg0fay | Explain: https://explain.depesz.com/s/jN3V (101ms)

Sometimes queries end up even a lot faster, for example the same as above, but searching for "c plus plus", runs in this plan + runtime: https://explain.depesz.com/s/NPOc (11ms)

Thanks a lot for sharing this!

Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.

Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

> What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

It’s far more problematic to actually load that data from disk.

> That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

Yes, it works, but it is slow because the tsvector is usually big enough to be stored in a TOAST table, and this produces a lot of random access reads.

This is why there is a project of storing additional information in the GIN (term positions) in order for the index to contain all necessary information for the ranking:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

> usually big enough to be stored in a TOAST table

Ah, luckily, in my case, that can’t happen – each row’s message contains one IRC message, so at most 512 bytes. That also automatically ensures we’ll never run into TOAST issues.

Now I understand why you didn't suffer from the slow ranking issue. In my test case, text is longer and triggers the TOAST management code.
Ordering can get expensive no matter what just base on how many things you're actually sorting. Ideally, if you can find a way to limit the size of the data set before the ranking sort you'll see a big improvement.
Currently, the only way to make ranking tolerable is to limit the size of the data set before the ranking sort. This is what I did in my tests.

But it looks like it could be possible to massively improve ranking performance by storing all necessary information directly in the GIN index, as proposed here:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...