| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by combatentropy 3551 days ago

I had used PostgreSQL for a decade, including full-text search, but just within apps that were already storing their data in Postgres.

The time came to replace our website search (tens of thousands of pages), and we decided to try rolling our own. Someone suggested ElasticSearch, and as I read through it, it seemed to do less than PostgreSQL. I still had the hard problems of (1) spidering the site and (2) converting all the file formats (.doc, .xls, .pdf, etc.).

I ended up just putting wget on a daily cron job to spider the site. Then I ran the saved files through a hodgepodge of scripts to extract the plain text and put it into PostgreSQL.

Once it's there, it's far easier to do the rest. Postgres has its own functions to search for matches, rank the matches, give you snippets, and even highlight the search words in the snippets. It's amazing.

Searches run in a split second. Well, at first, when I was testing, they often took a few seconds. But the weird thing is that after go-live it ran faster. My best guess is that so many users caused Postgres to cache more and more of itself into RAM. The whole server is still using less than 1 GB though, and it's running Apache and Postgres for the website and all its apps.

1 comments

ngrilly 3551 days ago

What is the on-disk size of the table storing the plain text?

link

combatentropy 3551 days ago

42 MB for the table, which has columns for the address, title, plain-text body, and computed text vectors for 43,000 pages (web pages and office documents of average length). Then another 100 MB for the GIN index on the text-vector column.

link

ngrilly 3551 days ago

That's only 977 bytes per page (42 MB / 43,000 pages). Are you sure about the numbers? Maybe the plain-text body is stored in a TOAST table?

Anyway, it looks like the whole dataset can fit in RAM, which explains the excellent performance, even with relevance ranking.

link

combatentropy 3551 days ago

Whoops! Yes, there is a corresponding TOAST table that I had to track down.

  Table: 341 MB (main table + TOAST table)
  Index: 100 MB

link