I do some searching with pgsql with tiny datasets and ts_rank, on a 10GB dataset of 11 million rows (mostly chat data), and get response times for ranking over all of it around 10-100ms on a cheap OVH 5€ VPS.
Sometimes queries end up even a lot faster, for example the same as above, but searching for "c plus plus", runs in this plan + runtime: https://explain.depesz.com/s/NPOc (11ms)
Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.
Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.
> What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.
That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).
It’s far more problematic to actually load that data from disk.
> That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).
Yes, it works, but it is slow because the tsvector is usually big enough to be stored in a TOAST table, and this produces a lot of random access reads.
This is why there is a project of storing additional information in the GIN (term positions) in order for the index to contain all necessary information for the ranking:
> usually big enough to be stored in a TOAST table
Ah, luckily, in my case, that can’t happen – each row’s message contains one IRC message, so at most 512 bytes. That also automatically ensures we’ll never run into TOAST issues.
Yeah, if your vectors are in TOAST, you really have a huge issue with ranking. There’s no simple way to get around that, except with customized solutions like Lucene/Solr/ES
Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.
Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.