Hacker News new | ask | show | jobs
by curioussavage 3104 days ago
Any good open source desktop software with linux support to do this? I don't see why I would personally want a web app for this.
5 comments

It's a little clunky but here's the one I found best that just worked on Ubuntu: http://gscan2pdf.sourceforge.net/ . It can combine some of the best tools for OCR/cleanup/etc.

My main gripe is that I have a document feeder and manually selecting pages with shift to combine in to a single document and clicking "Save as" is far too much of a hassle. There needs to be a better flow for that.

I wrote a collection of bash scripts for that. https://github.com/coaxial/insaned-config

It was initially to use with insaned, but I later came up with a script to tie it all together (scan.sh) because it's faster than jamming the scan button waiting for insaned to register. And with the script, I can queue commands provided I'm fast enough to swap the physical pages in the flatbed scanner.

It also uses the excellent textcleaner imagemagick script to clean up the scans and make them more ocr friendly.

The readme isn't totally up to date, parallel isn't required anymore, and there is no mention of the scan.sh script. But when you run it, it prompts for commands. You might need to edit the scripts to set your own output directories and textcleaner location.

I haven't tried this yet, but - https://openpaper.work/

Edit: tried it, it's crap.

May I ask why? Installation is a bit cumbersome but it seems to be an outstanding program to me. I have been looking very long for something like this.

I have not tired yet how it reacts to huge amounts of data. But best thing: NOT written in Java!

I installed it on Windows, so the installer was the best bit :)

Maybe it's better on linux but it didn't use system dialogues, the UI behaved a bit strangely and it wasn't particularly intuitive.

Maybe I'm just not the target - in a previous life I supported a HP TRIM ECM which may have left a mark on me.

Maybe just put the scanned pdfs into a hierarchical folder system, then keep a text file at the root with comma or tab-separated location, ISO date and keywords.

Then your documents are a grep away. Maybe awk to find documents from a date range?

Maybe someone clever could automate this with the OCR output...

There are, I just can't think of them at the moment, I know though because I setup a bookscanner with a linux box. If I remember right the scan/ocr/archive tools are all seperate, so you would have to script them together.
Well, if you have a home server, having a web app works quite well. But if you don't, then a desktop app would probably be better.