Hacker News new | ask | show | jobs
by shubham_saboo 1430 days ago
Wao, this is a really cool way to build full fledged search that too in a notebook!

Does it work end-to-end with PDF as a data structure or do we have to use OCR and parse the text first to be able to search it, really curious?

4 comments

The version in the notebook is just for simple text-based PDFs. I wrote some posts on our company blog[1] about the sheer agonies of dealing with PDF as a data format, so wanted to stick with as simple as possible for now.

That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.

[1] https://medium.com/jina-ai

Awesome, will be on the lookout for that!
We've got quite a few other notebooks for other kinds of search on the blog. Would love to hear your thoughts!
Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.
You mean the PDFSegmenter Executor in the notebook?
Yes
PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline
"PDF as a data structure"

Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)

I may have misworded it (if I wrote those words - PDF rots the brain and my memory likewise).

Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.

I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.

I probably didn’t read your comment closely enough. When I hear about PDF parsing or PDF as data I immediately get flashbacks from a project years ago where I had to parse PDF files. I think I am still traumatized by this experience so whenever I hear somebody wants to do this I just want to scream “Nooo. Don’t do this”
I think you and I should start a support group!
Incidentally Jina Hub [0] has a few OCR Executors [1][2] you could integrate into my notebook (though you'd have to do some rewiring to take images into account since it's a text-based notebook)

[0] https://hub.jina.ai/

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai/executor/78yp7etm