| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shubham_saboo 1430 days ago
	Wao, this is a really cool way to build full fledged search that too in a notebook! Does it work end-to-end with PDF as a data structure or do we have to use OCR and parse the text first to be able to search it, really curious?

4 comments

alexcg1 1430 days ago

The version in the notebook is just for simple text-based PDFs. I wrote some posts on our company blog[1] about the sheer agonies of dealing with PDF as a data format, so wanted to stick with as simple as possible for now.

That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.

[1] https://medium.com/jina-ai

link

shubham_saboo 1430 days ago

Awesome, will be on the lookout for that!

link

alexcg1 1430 days ago

We've got quite a few other notebooks for other kinds of search on the blog. Would love to hear your thoughts!

link

rahimnathwani 1430 days ago

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.

link

alexcg1 1430 days ago

You mean the PDFSegmenter Executor in the notebook?

link

rahimnathwani 1430 days ago

Yes

link

alexcg1 1430 days ago

PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline

link

spaetzleesser 1430 days ago

"PDF as a data structure"

Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)

link

alexcg1 1430 days ago

I may have misworded it (if I wrote those words - PDF rots the brain and my memory likewise).

Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.

I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.

link

spaetzleesser 1430 days ago

I probably didn’t read your comment closely enough. When I hear about PDF parsing or PDF as data I immediately get flashbacks from a project years ago where I had to parse PDF files. I think I am still traumatized by this experience so whenever I hear somebody wants to do this I just want to scream “Nooo. Don’t do this”

link

alexcg1 1429 days ago

I think you and I should start a support group!

link

alexcg1 1430 days ago

Incidentally Jina Hub [0] has a few OCR Executors [1][2] you could integrate into my notebook (though you'd have to do some rewiring to take images into account since it's a text-based notebook)

[0] https://hub.jina.ai/

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai/executor/78yp7etm

link