Hacker News new | ask | show | jobs
by fatbird 3552 days ago
I used it at 9.4 for a document management system with thousands, not millions, of PDFs that got indexed on upload, and it worked extremely well at that scale--fast, and with all the basic text search features well-covered (tokenization, stemming, etc.). A big win for me was that doing it well in Postgres meant the site could stay a simple Django site rather than adding another service.
2 comments

Did you store the plain text of each PDF in PostgreSQL or just the ts_vector resulting from the plain text?
IIRC, I stored the plain text too because the engine can return contextually marked up plaintext after finding it in the ts_vector.
You're right, PostgreSQL needs the plain text to highlight it with ts_headline. It's similar to Elasticsearch keeping the original document in the _source attribute. Thanks!
Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?
I'm really reaching here to recall, but the short version is that actual searches never took more than a second. All I really cared about was how noticeable a delay to expect, and it was never more than that.

On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.