| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rajaravivarma_r 780 days ago

Is it possible to extract different patterns of text from a PDF document?

For example, paragraphs, code blocks, code inlined in paragraphs etc?

I tried tesseract but it recognises code blocks as tables.

Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.

Appreciate any help.