It's a little clunky but here's the one I found best that just worked on Ubuntu: http://gscan2pdf.sourceforge.net/ . It can combine some of the best tools for OCR/cleanup/etc.
My main gripe is that I have a document feeder and manually selecting pages with shift to combine in to a single document and clicking "Save as" is far too much of a hassle. There needs to be a better flow for that.
It was initially to use with insaned, but I later came up with a script to tie it all together (scan.sh) because it's faster than jamming the scan button waiting for insaned to register. And with the script, I can queue commands provided I'm fast enough to swap the physical pages in the flatbed scanner.
It also uses the excellent textcleaner imagemagick script to clean up the scans and make them more ocr friendly.
The readme isn't totally up to date, parallel isn't required anymore, and there is no mention of the scan.sh script. But when you run it, it prompts for commands. You might need to edit the scripts to set your own output directories and textcleaner location.
May I ask why? Installation is a bit cumbersome but it seems to be an outstanding program to me. I have been looking very long for something like this.
I have not tired yet how it reacts to huge amounts of data. But best thing: NOT written in Java!
Maybe just put the scanned pdfs into a hierarchical folder system, then keep a text file at the root with comma or tab-separated location, ISO date and keywords.
Then your documents are a grep away. Maybe awk to find documents from a date range?
Maybe someone clever could automate this with the OCR output...
There are, I just can't think of them at the moment, I know though because I setup a bookscanner with a linux box. If I remember right the scan/ocr/archive tools are all seperate, so you would have to script them together.
My main gripe is that I have a document feeder and manually selecting pages with shift to combine in to a single document and clicking "Save as" is far too much of a hassle. There needs to be a better flow for that.