| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rahimnathwani 1430 days ago
	Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.

1 comments

You mean the PDFSegmenter Executor in the notebook?

Yes

PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline