| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kuschku 3550 days ago

I do some searching with pgsql with tiny datasets and ts_rank, on a 10GB dataset of 11 million rows (mostly chat data), and get response times for ranking over all of it around 10-100ms on a cheap OVH 5€ VPS.

Query: https://paste.kde.org/pcxyg0fay | Explain: https://explain.depesz.com/s/jN3V (101ms)

Sometimes queries end up even a lot faster, for example the same as above, but searching for "c plus plus", runs in this plan + runtime: https://explain.depesz.com/s/NPOc (11ms)

1 comments

ngrilly 3550 days ago

Thanks a lot for sharing this!

Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.

Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

link

kuschku 3550 days ago

> What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

It’s far more problematic to actually load that data from disk.

link

ngrilly 3550 days ago

> That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

Yes, it works, but it is slow because the tsvector is usually big enough to be stored in a TOAST table, and this produces a lot of random access reads.

This is why there is a project of storing additional information in the GIN (term positions) in order for the index to contain all necessary information for the ranking:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

link

kuschku 3550 days ago

> usually big enough to be stored in a TOAST table

Ah, luckily, in my case, that can’t happen – each row’s message contains one IRC message, so at most 512 bytes. That also automatically ensures we’ll never run into TOAST issues.

link

ngrilly 3550 days ago

Now I understand why you didn't suffer from the slow ranking issue. In my test case, text is longer and triggers the TOAST management code.

link

kuschku 3550 days ago

Yeah, if your vectors are in TOAST, you really have a huge issue with ranking. There’s no simple way to get around that, except with customized solutions like Lucene/Solr/ES

link