Hacker News new | ask | show | jobs
by talolard 1972 days ago
I have a few suggestions regarding search performance.

As others have mentioned, there isn't one "right answer" but there are a few nifty tricks you can use to boost performance.

You could try a trigram index[0] as the primary search index. You'd loose stemming, but gain language support and indexed regular expression search which is a nice power user feature.

Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.

if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery. [1] from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).

You should experiment with GIN vs GIST indices. GIN are larger, so on a large dataset the index might not fit in memory. You could pay more to have more memory, but worth trying a GIST index to see if that makes things faster just because it fits in memory.

A final frontend comment, I'm a fan of infinite scroll for this kind of stuff. You already have pagination effectively set up, you could spoil me as a user with infinite scroll. react-virtualized[2] is a great library for that.

Hope that helps.

[0] https://www.postgresql.org/docs/9.6/pgtrgm.html [1] https://dba.stackexchange.com/questions/4054/ts-rank-massive... [2] https://github.com/bvaughn/react-virtualized

1 comments

Thanks for the suggestions! I will try them next.

> Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.

Here is the index:

  update gutenberg.paragraphs
  set textsearchable_index_col = to_tsvector(b.cfgname::regconfig, coalesce(paragraph, ' '))
  from
  gutenberg.all_data a 
  inner join pg_ts_config b on lower(a.language) = b.cfgname 
  where gutenberg.paragraphs.num = a.num;
  create index textsearch_paragraph_idx on gutenberg.paragraphs using gin (textsearchable_index_col);
> if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery. [1] from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).

Ranking was already done as a column in a CTE and ranked further down:

  with paragraphs as (
                select 
                num
                , paragraph
                , ts_rank_cd(textsearchable_index_col, phraseto_tsquery(%s::regconfig, %s), 32) as rank 
                , ts_headline(%s, paragraph, phraseto_tsquery(%s::regconfig, %s), 'MaxFragments=1000, StartSel=**, StopSel=**') as highlighted_result 
                from gutenberg.paragraphs 
                where language = %s::regconfig and textsearchable_index_col @@ phraseto_tsquery(%s::regconfig, %s)
            ) 
I did rewrite the query as per your [1] but EXPLAIN ANALYZE found no difference.

Pagination is a tricky one. I was in a rush and hit an issue where building large Dash tables was very expensive and prone to crashing. I initially set up pagination as per the Dash examples, but that didn't play nice with markdown which I needed for the URLs and highlighting (giving these up for plain text tables made in HN-fast, but it confused my trial users). So the quick and dirty solution "to ship now" was to set up custom pagination in the SQL and show the results.

I think that if there is enough interest in this project, the front end will have to move off Dash.