Hacker News new | ask | show | jobs
by rahimnathwani 1430 days ago
Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.
1 comments

You mean the PDFSegmenter Executor in the notebook?
Yes
PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline