| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by NewDimension 2373 days ago
	Somewhat offtopic, do you know of a library that would allow me to select an area of a PDF through a GUI and only read the text in those coordinates?

7 comments

ncallaway 2373 days ago

The tesseract-cli (and so I'm sure the library also) will give you HOCR output, which is an HTML format that gives you the text, with bounding boxes around paragraphs and individual characters.

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line...

It's not quite what you want, but I think you could probably filter the output based on the selected region and pretty quickly get what you want.

link

narayanans 2372 days ago

Try tabula[0]

It is opensource and runs on Java.You can also extract the areas of interest in the pdf and run it via cmdline[1].You can get more details if required on my blog[2]

[0]https://tabula.technology/

[1]https://github.com/tabulapdf/tabula-java/wiki/Using-the-comm...

[2]https://narayanansiyer.com/Tabula/tabula/

link

sailfast 2372 days ago

I think the Project Naptha extension by the folks that wrote this library will do that, no? https://projectnaptha.com/

Not sure if it only reads at those coordinates vs. OCRing the whole thing (for example if you were legally prohibited from OCRing content outside a certain coordinate space), but it is selectable.

link

severine 2372 days ago

You could simply pipe an area screenshot to tesseract, discard the input image and get the tesseract output, am I wrong?

link

NewDimension 2372 days ago

That sounds like a valid approach, any idea what tools I could use to get the define the area and get the screenshot?

link

severine 2372 days ago

You possibly have one installed. Mine comes with my desktop (Xfce), and gives me a GUI and a CLI to take screenshots of the full desktop, any window, or a particular area defined by crosshairs.

There's a very popular and minimalist CLI called scrot that I think would be ideal... well scratch that, I made a search and our question has already been asked and answered:

https://askubuntu.com/questions/280475/how-can-instantaneous...

https://stackoverflow.com/questions/21497447/ocr-on-a-screen...

link

mkl 2372 days ago

If I remember correctly, I did it with the ImageMagick "import" command. I found I had to add a wide white border, as Tesseract got confused near the edges of the image (this was over 10 years ago though).

link

mdtusz 2372 days ago

I'm not sure if there's a non GUI interface for it, but zathura does this for pdfs.

link

jjohansson 2373 days ago

Commercial or open source? PDFTron can do it, but they’re not an open source project.

link

NewDimension 2372 days ago

I prefer something I can install locally (doesn't need to be open source). I'm trying to extract text from a PDF at a certain position, the PDF is indeed text not an image so OCR isn't strictly needed.

The goal is to draw a box using GUI, then use those coordinates to extract text from several homogeneous pages.

I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.

link

severine 2372 days ago

Some reading here: https://stackoverflow.com/questions/53219016/detecting-secti...

link

jjohansson 2372 days ago

PDFTron provides an SDK and isn't really meant as a plug-and-play end-user application. But it can accomplish what you're looking for.

Here's how to extract text from a PDF based on coordinates (this explains how to do it on web, but it's also possible using other platforms):

https://groups.google.com/d/msg/pdfnet-webviewer/h2W3VksbQUI...

Here's how to extract a PDF's logical structure:

https://www.pdftron.com/documentation/samples/#logicalstruct...

link

pierre 2372 days ago

Pdf.js and filtering the output. Par.sr with the good input module configuration

link