Y
Hacker News
new
|
ask
|
show
|
jobs
by
rahimnathwani
1430 days ago
Under the hood, it uses
https://github.com/pdfminer/pdfminer.six
which expects the text to be stored as text.
1 comments
alexcg1
1430 days ago
You mean the PDFSegmenter Executor in the notebook?
link
rahimnathwani
1430 days ago
Yes
link
alexcg1
1430 days ago
PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline
link