Hacker News new | ask | show | jobs
by agconti 3081 days ago
I can deeply vouch for the performance and joy of implementing Full-Text Search with PostgreSQL and Django.

I just finished a a project where we chose Postgres's FTS over using Elastic Search. At the beginning, I was worried about what performance we'd see since we choose to not use ES. But after slight performance tweaking, we had even our least performing queries under 50ms.

5 comments

You need to look at other stuff than performance - relevancy is probably the biggest thing when implementing search. Is it more relevant than what you experienced with ES?
I'd argue that relevancy is more your application application's design then the underlying system retrieving the results. For example, putting the same dataset in Postgres or ES wouldn't make one deliver more relevant results given equal configurations.

You could lean on the relevancy strategies built in to ES, but in my experience you're better off understanding what relevancy means for your dataset and implementing a strategy yourself. Your millage may vary though, I'd never advocate reimplenting something that's already provided by your chosen tool.

The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases. If you're interested you can check out Postgres' search rank and query weighting configurations.

I disagree.

I've had to build complex queries against ElasticSearch and it is specifically designed for things like this. We had custom weightings so when you searched for certain natural keys associated with each item they would rank above everything else, and that is easily do-able with ES. Simultaneously, we would weigh results according to various metadata we had attached to each entry (audio stream languages, subtitles, content owner name, genre, etc.). And finally, if you searched for the name of the media (a movie or an episode in a TV show) the user would see all the matches ranked accordingly, but again weighed according to the content owner and various language features of that media file.

You can probably hack that together with PostgreSQL, but is basically one big query in ES. PG's FTS is still great; but its use-case is slightly different.

That’s cool and I hear you; that’s a complex relevancy definition.

( Maybe surprisingly? ) This is type of query is natively supported by Postgres. That support is robust and mature, you don’t have to hack it together.

ES is a great tool and it’s clear your a fan of it. If you’re interested, I’d recommend you look into Postgres capabilities. It’s not a replacement for ES by any means ( or even a competitor to in my opinion; Postgres isn’t even distributed ). But for specific use cases, you might find that Postgres capabilities surprise you!

(I did leave out some bits that made it more complicated than I indicated.) Also, this was years ago; I know PG has improved its FTS a lot since then. ES is just a useful tool. If you can express your problem in relational terms then a RDBMS is almost always the right choice.

By the way, I am a huge fan of PG and relational databases in general; PG, especially, is a great database, and the first tool I reach for when it comes to data storage. However, we had other requirements (aside from the complexity I left out) to do with versioning and so forth that swung in favour of ES. Ultimately the problem with FTS in RDBMS, for me, boils down to doing FTS across disparate -- let's call them 'documents' -- stored across multiple tables. Basically you have to use materialised views (with manual refreshing) or complex join mechanics that affect performance. Perhaps PG 10 has improved in this area also?

In my article the "document" contain data from different table and i stored it in specific column and it's very fast with a GIN index on it.
I was about to give him the same answer but you beat me to it. It’s not jus throwing stuff in a field in PG. You can weight multiple bits of data. You can even define multiple different search vectors in separate columns if you want to use different search styles in different situations.
Thanks! I replied to the OP; but weighing, although critical to our needs, was a small part of a larger problem space we had to solve. I hate introducing new technology unless it's strictly required, but we ran into limitations that forced us to down the road of using ES.
This is true.
Thanks for your reply , my opinion is th same
> You could lean on the relevancy strategies built in to ES, but in my experience you're better off understanding what relevancy means for your dataset and implementing a strategy yourself.

Ranking is hard. You SHOULD lean on the tools available in Lucene/Solr/ES. PG's ranking tools are a joke in comparison.

> The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases.

That simply isn't true.

I don't the sense of the article is that PG FTS is better than ES , but in some situation, as the one I illustrated in my article, you can implement a the same search function with both of them, but with if have PG already in your stack configuring and using it with Django is very simple and convenient.
> For example, putting the same dataset in Postgres or ES wouldn't make one deliver more relevant results given equal configurations.

This is not the case.

The OOTB search capabilities in ElasticSearch (even by default) far, far exceed what you get in PostgreSQL FTS.

Also you're completely contradicting yourself. You say you don't advocate reimplmenting something provided by the tool but then suggest doing exactly that.

But in many situation you don't need all the ES features and you can implement a quite good and fast FTS function directly with your PostgreSQL database if you already use it in your stack.
> The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases.

Maybe in very very limited scenarios, but in general, they aren't even close. PostgreSQL doesn't take corpus frequencies into account, which makes it pretty difficult to come anywhere near the relevance ranking quality of Elasticsearch (or any proper search engine).

In order to tell whether PG vs Elastic is appropriate for your use case, you need to do an evaluation. See: https://en.wikipedia.org/wiki/Text_Retrieval_Conference

I second this. You can't let your queries take unusable amounts of time, but below a certain threshold relevancy is infinitely more important.

I'm putting together a product which has a search feature and that uses Django + MySQL and I'm struggling with relevancy. I'd happily accept 500ms queries if that guaranteed me the relevant hit would be on the first page. That's FAR more usable than 50ms queries and then the relevant hit is on page 5.

Full text search in MySQL isn’t in the same ballpark as PG. Thats not a dig at MySQL, just praise for the quality of what you get from the PG implementation.
Why mysql? Search in pg is waaay better.
Yes I think you need both of them of course and I found it on my project with Django and PostgreSQL.
Second this, even if you are a PG fanboy and a search newbie, you need to pay attention to:

1. issues with i18n and l10n tokenization. Does PG support other languages?

2. At minimum you need to support tf-idf (or something better), it doesn't look like PG supports this either.

3. For extremely dumb ranking, you can have a render/engaged column in PG. For decent production stuff you need a decision tree ranker (or GBDT).

All in all, none of these are there in PG, I'm not familiar with Solr/Lucene either, but please educate yourselves before expressing such strong opinions marketed as the absolute truth.

PG FTS support other languages https://www.postgresql.org/docs/current/static/textsearch-ps... Anyway the point of my article is not that PG FST is better than ES, but that for a quite good and fast FTS function you can use only Django and PostgreSQL and most of the time you don't need all the other ES features and at the same time your stack will be easier to build and maintain.
I think search relevancy is very important, and I wrote in my article start using PG FTS had permitted to work on search relevancy because I had more time which I used before in ES configuration and maintain another layer in my stack.
Having implemented this for a client in the past I have to agree that it is a cheaper option than ElasticSearch, especially for smaller projects with a lower number of records to index.

ElastiSearch easily gets expensive and the search suggestion is pretty bad.

What does this mean ?

ElasticSearch starts off as a small Java application that wraps the Lucene library.

Obviously heap will increase with usage and number of documents but I am still confused how it is in any way "expensive".

The stack I proposed in my article is pretty simple: Django + PostgreSQL (DB + FTS).

In other project I used Elastic for the search function: Django + PostgreSQL (DB) + Haystack + ES (FTS).

Is obvious that the second solution is more expensive.

Thanks for your feedback, I obviously agree with you, but I'm starting to plan to use PG FTS with Django also in some bigger project. I hope to write another article about it in new future.
Depends on the data right. I've seen good performance on larger tables after using GIN indexing with records that rarely needs to be updated and simple queries. I'm not a expert in PostgreSQL by any means, but reducing cost and learning something never hurts.
I totally agree with you
Any tips on what is worth tweaking?
Absolutely.

- Using a `SearchVectorField` is a must after 500K rows.

- Make keeping this field up to date easy for yourself by populating it using `SeachVector` with a Django pre_save signal or PostgreSQL trigger. This reduces CPU utilization significantly as the parsing and tokenization of the field your searching on is done a head of time.

- Adding a GIN Index on your `SearchVectorField` column will improve performance dramatically.

- You should specify your language configuration for postgres FTS parser. The default parser doesn't do much. It just removes spaces and normalize case. Specifying a language lets the parser make heavier optimizations that noticeably improve performance and the quality of results. If you need support for more then one langue, Django already makes it easy for this configuration to be dynamic.

Why not just have a GIN index on the expression to_tsvector(body, 'english') or whatever? Then you don't need to maintain a separate column.
Not the OP, and I haven't used PG regularly for a few years, but back then I vaguely recollect the query optimizer not behaving consistently based on whether you had a full column with a GIN (or GIST) index and a (potentially partial) index on an expression for some reason. In a nutshell it preferred using the full column with an index rather than the expression index.

Even more importantly in some circumstances, having the full column allows the optimizer to pick another index when it's totally relevant, and filter the relevant rows without needing to recompute the TSV one by one for the subset.

If your "document" is based also on columns on other table, as in the example on my article, you can't have a GIN index on your expression, but you can have a GIN index on your specified column.
The Postgres docs suggest using another column. My guess is that an expression index would be too large if it held the tokenized value of all of your FTS documents. These things can and often are entire written documents. Imagine the index size for 2 Million rows of tokenized documents at 2,000 words each.

You might be able to get away with if if you were indexing a less then large amount of very small documents.

I like your expression index idea a lot.

But if you're maintaining a separate ts_vector column, which you then index, you're creating the exact same amount of index data.

Unless you're saying that you would populate the field only on some rows and not all of them, and control this from the app. But you could do that with an expression index, too, assuming the rule is a simple, pure function:

  CREATE INDEX index_posts_on_body
  ON posts (to_tsvector(body, 'english')) 
  WHERE published = true;
or similar.
Very good tips ;-)
I can report the same with both Rails and Elixir. People reach for outside search tools far too quickly.
You are right
Thanks, I'm happy to read similar experience from other developers.