Hacker News new | ask | show | jobs
by fabian2k 3551 days ago
Just from reading the documentation, the full text search features on Postgres already look pretty powerful. And it is encouraging that they are actively being worked on. I'm wondering how this compares to a dedicated search engine like Solr or Elasticsearch.

Are there huge differences in performance, features or search quality? At which scale does using Postgres for full text search still make sense?

14 comments

Having used all 3, Postgres search is my go to for most use case simply because I don't have to deal with managing deltas to an outside system and keeping things in sync. The search features are powerful and fast and PG's ability to combine multiple indexes in search results make it trivially easy to include a bit of full text search in a query right next to geographic distance filters or other conditions. You can also combine multiple types of searches on the fly if you're feeling whimsical.

IMO, the only time to reach for an outside system is when the data isn't being written to PG first (like log ingestion with elastic search) or when search is such a central part of your app that it mandates a separate dedicated system.

Are there any good options to support logic (and/or) and facets/fields with Postgres? We started using ES basically just for the "free" query language. (Obviously we would want something that is safe from sql injection.)
SQL supports logic. Either escape manually, use the templating in your driver or use an ORM.
So, "No".
I'm confused. Seems the answer is "yes"?
What about when having different PG database instances that has data you want to join on? Would you still use PG as an aggregated read-only copy of the databases or would you use for example ES?
It's situational. When you need a search across multiple data sources, standing up a dedicated search engine makes a lot more sense. Then again...PG Foreign Data wrappers would make that scenario pretty simple without the need for an aggregate.

I can't speak to performance in that situation though.

I used it at 9.4 for a document management system with thousands, not millions, of PDFs that got indexed on upload, and it worked extremely well at that scale--fast, and with all the basic text search features well-covered (tokenization, stemming, etc.). A big win for me was that doing it well in Postgres meant the site could stay a simple Django site rather than adding another service.
Did you store the plain text of each PDF in PostgreSQL or just the ts_vector resulting from the plain text?
IIRC, I stored the plain text too because the engine can return contextually marked up plaintext after finding it in the ts_vector.
You're right, PostgreSQL needs the plain text to highlight it with ts_headline. It's similar to Elasticsearch keeping the original document in the _source attribute. Thanks!
Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?
I'm really reaching here to recall, but the short version is that actual searches never took more than a second. All I really cared about was how noticeable a delay to expect, and it was never more than that.

On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.

I have found Postgres to be good enough for search.

As in... it works well enough, and the advantage of not having to add other tech makes it a no-brainer, I've had zero support issues or customer complaints and most of my applications use full text search heavily.

The big advantage over other approaches, because it's SQL and it's there in my database where I also store users and permissions knowledge... I can permission-limit my fulltext searches.

We have been using Postgres Full Text Search for about 3 years now in production. The app is an analytics dashboard, over a set of structured and unstructured data. We have about 20M documents, with hierarchies, dimensions, but also free text elements. It does work extremely well, and having the possibility to group by as one would do in SQL is a god send for tabular or graph based data. Performance are really good, in particular due to the parallel aggregations.

We tested recently to load our index to an Elasticsearch index for one particular use case (a weighted sum of the 20M rows based on a FTS critera) where postgres was underperforming in our opinion. On the same hardware, using all available RAM and CPUs, ES took 6s and PG took 0.7s.

So far, on the 30+ queries of our dashboard tool, we have yet to find a use case that Postgres didn't handle better than Lucene based solutions.

Mind sharing a table structure from your db? I'm using ES for a project and would prefer to keep things simple (already use postgres in another part of the system).
We use Xapian to search over millions of documents. We are thinking of switching to PostgreSQL built-in FTS to simplify our system. We ran an internal benchmark which showed that PostgresSQL can be competitive with Xapian, except when you need to rank results (in that case the performance is bad).
You'll be interested in ongoing work in this area, then. Oleg & Teodor are working on a new index type (RUM indexes, no less) which will speed up ranking operations conserably.

https://lwn.net/Articles/689387/

Hi Josh. Yes, I'm aware of this new type of index. Oleg was even kind enough to answer my email asking a few question about it :-)
Does anyone have experience with ZomboDB?

"ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. In order to achieve this, ZomboDB implements Postgres' Access Method API.

In practical terms, a ZomboDB index appears to Postgres as no different than a standard btree index. As such, standard SQL commands are fully supported, including SELECT, BEGIN, COMMIT, ABORT, INSERT, UPDATE, DELETE, COPY, and VACUUM."

https://github.com/zombodb/zombodb

It looked promising when I was evaluating it a few months ago, but it's limited to use with Elasticsearch 1.x, which was not going to work for us.
As the developer, I do!

Feel free to email the mailing list (zombodb@googlegroups.com). I'd be happy to help answer any questions you might have

I had used PostgreSQL for a decade, including full-text search, but just within apps that were already storing their data in Postgres.

The time came to replace our website search (tens of thousands of pages), and we decided to try rolling our own. Someone suggested ElasticSearch, and as I read through it, it seemed to do less than PostgreSQL. I still had the hard problems of (1) spidering the site and (2) converting all the file formats (.doc, .xls, .pdf, etc.).

I ended up just putting wget on a daily cron job to spider the site. Then I ran the saved files through a hodgepodge of scripts to extract the plain text and put it into PostgreSQL.

Once it's there, it's far easier to do the rest. Postgres has its own functions to search for matches, rank the matches, give you snippets, and even highlight the search words in the snippets. It's amazing.

Searches run in a split second. Well, at first, when I was testing, they often took a few seconds. But the weird thing is that after go-live it ran faster. My best guess is that so many users caused Postgres to cache more and more of itself into RAM. The whole server is still using less than 1 GB though, and it's running Apache and Postgres for the website and all its apps.

What is the on-disk size of the table storing the plain text?
42 MB for the table, which has columns for the address, title, plain-text body, and computed text vectors for 43,000 pages (web pages and office documents of average length). Then another 100 MB for the GIN index on the text-vector column.
That's only 977 bytes per page (42 MB / 43,000 pages). Are you sure about the numbers? Maybe the plain-text body is stored in a TOAST table?

Anyway, it looks like the whole dataset can fit in RAM, which explains the excellent performance, even with relevance ranking.

Whoops! Yes, there is a corresponding TOAST table that I had to track down.

  Table: 341 MB (main table + TOAST table)
  Index: 100 MB
Yes there are huge differences between quality and performance. Apache Lucene (powers Solr and ES) is still the best by far. However if Postgres search works well enough for your use case then great. As others have said, it is one less dependency to manage.
This has been around for a while but it's a great summary of what Postgres can do: http://rachbelaid.com/postgres-full-text-search-is-good-enou...

In my experience performance is great if you're just doing text search, but if you combine that with other operators in the same SELECT it can be much slower than Elasticsearch since in many of those cases Postgres needs to fall back to a full table scan.

If you create indexes for all the columns used as filters, which is somehow what Elasticsearch does, then PostgreSQL is able to combine them (generating a bitmap for each used index and ANDing them) and you should get decent performance, don't you?
While it's ok for our purposes, I would wish for a bit better customisability of the text parser and it definitely needs better support for compound words to be perfect.

The first issue is with relation to https://www.postgresql.org/docs/9.6/static/textsearch-parser...: The documentation says

> At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications

and it really means it - changing the behaviour of this component is not possible unless you write a completely different parser in C which, while possible is no fun experience.

We're using the full text feature over product data and we're having to work around the parser sometimes too eagerly detecting email addresses and URLs which messes with properly detecting brand names which might contain some of these special characters.

The other problem is the compound support. A lot of our data is in German which like other languages likes to concatenate nouns.

For example, you'd absolutely want to find the term "Weisswürste" for the query "wurst" (note the concatenation and the added umlaut for the plural in wurst).

Traditionally, you do this using a dictionary and while Postgres has support for ispell and hunspell dictionaries, only hunspell has acceptable compound support, which in turn isn't supported by Postgres.

So we've ended up using a hacked ispell dictionary where we have to mark all known compounds which is annoying and error-prone.

Also, once you have to use a dictionary, you end up with a further issue: Loading the dictionary takes time and due to the way how Postgres currently works, it has to happen per connection. In our case, with the 20MB hacked german ispell dictionary, this takes ~0.5s which is way too long.

The solution for this is to use a connection pooler in front of Postgres. This works fine but, of course, adds more overhead.

The other solution is http://pgxn.org/dist/shared_ispell/, but I've had multiple postmaster crashes due to corrupted shared memory (thank you, Postgres, for crashing instead of corrupting data) related to that extension, so I would not recommend this for production use.

Lucene and by extension ElasticSearch has much better built-in text analysis features so we could probably fix the parser and compound issue, but that would of course mean even more additional infrastructure, plus, probably some performance issues as we, unfortunately, absolutely cannot return all the FTS matches but instead have to check them for other reasons why they must not be shown which, of course, uses the database again and I'm wary of putting all that logic somehow into ES as well.

This is why we currently deal with the postgres tsearch limitations. But sooner or later, we'd probably want to bite the bullet and go dedicated solution.

Do you rank full-text search results using something like ts_rank? If yes, do you suffer from slow queries?
We use it, but we don't suffer slow queries in our case.
I'd like to know more about your case, because my own experience is that ordering by ts_rank causes a big slowdown.

PostgreSQL documentation says: "Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."

Some PostgreSQL developers are working on improving this by using indexes only to compute the ranking, but the related patches are not done yet.

What is the size of your data set (number of rows and size on disk) and the average response time?

I do some searching with pgsql with tiny datasets and ts_rank, on a 10GB dataset of 11 million rows (mostly chat data), and get response times for ranking over all of it around 10-100ms on a cheap OVH 5€ VPS.

Query: https://paste.kde.org/pcxyg0fay | Explain: https://explain.depesz.com/s/jN3V (101ms)

Sometimes queries end up even a lot faster, for example the same as above, but searching for "c plus plus", runs in this plan + runtime: https://explain.depesz.com/s/NPOc (11ms)

Thanks a lot for sharing this!

Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.

Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

Ordering can get expensive no matter what just base on how many things you're actually sorting. Ideally, if you can find a way to limit the size of the data set before the ranking sort you'll see a big improvement.
Currently, the only way to make ranking tolerable is to limit the size of the data set before the ranking sort. This is what I did in my tests.

But it looks like it could be possible to massively improve ranking performance by storing all necessary information directly in the GIN index, as proposed here:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

Someone here mentioned ZomboDB[1]. Would that help you?

[1] https://github.com/zombodb/zombodb

Usually an RDBMS like PostgreSQL is used in an environment that has a different usage pattern than search. SOLR can take advantage of specializing for that type of usage. However, Russia's largest search engine, Yandex, seems to like PG http://momjian.us/main/blogs/pgblog/2016.html#September_28_2...
I doubt they use it for actual search, though. Given that core company product for end-users is Search Engine, Yandex probably has some in-house system, that Mail team can leverage for their searching needs.
For the small data use cases I've seen, Solr always returns results quickly. You're limited in how you can query it - it's modeled as one giant table, and the query syntax is much more idiosyncratic than SQL.

Postgres is really reliable, and I think a lot of the performance difference comes from robust transactions. For some use cases you can use both and replicate data or query one + the other in sequence.

It looks like Postgres Professional is working on improving FTS. Here is a relevant presentation from Oleg Bartunov about the new RUM index and its benefits for FTS:

http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum...

Can pgsql fts do stemming or more complex lemmatisation for languages other than English? Or ranking of results based on Okapi BM25 or similar? I was looking into this about two years ago and those were the features in favor of Lucene (basis of ES and Solr).
Pgsql can do stemming and everything it can do in English also in several dozen languages, including German, French, Spanish, and any for which you install dictionaries. It’s quite useful
As far as I am aware, it does not support BM25. See pilif's comment about multiple language support.