| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fatbird 3552 days ago
	I used it at 9.4 for a document management system with thousands, not millions, of PDFs that got indexed on upload, and it worked extremely well at that scale--fast, and with all the basic text search features well-covered (tokenization, stemming, etc.). A big win for me was that doing it well in Postgres meant the site could stay a simple Django site rather than adding another service.

2 comments

ngrilly 3552 days ago

Did you store the plain text of each PDF in PostgreSQL or just the ts_vector resulting from the plain text?

link

fatbird 3552 days ago

IIRC, I stored the plain text too because the engine can return contextually marked up plaintext after finding it in the ts_vector.

link

ngrilly 3552 days ago

You're right, PostgreSQL needs the plain text to highlight it with ts_headline. It's similar to Elasticsearch keeping the original document in the _source attribute. Thanks!

link

pumainmotion 3552 days ago

Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?

link

fatbird 3552 days ago

I'm really reaching here to recall, but the short version is that actual searches never took more than a second. All I really cared about was how noticeable a delay to expect, and it was never more than that.

On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.

link