Hacker News new | ask | show | jobs
by PureParadigm 1691 days ago
Using Pdfsandwich in college was like having a superpower. We would often be given PDFs with only image data. While my peers were still scrolling through and copying quotes by hand, I was there in seconds with Ctrl-F to find and copy/paste.

Once you have text in the PDF, you can use any sort of text analysis tools. You can use tools to convert it to plain text and grep through, or anything else you want.

That being said, it's not perfect, but still pretty awesome. Sometimes the spacing was off or it would confuse symbols like 1, I, or l. But these are minor and usually only on poorly scanned PDFs.

3 comments

Here's a tangentially related fun fact! Before you take the uniform bar examination in New York, you first have to take an at-home section called the New York Law Examination. There is a book[1] that covers all the New York specific law that could be on the examination. It used to be provided as a simple PDF, where you could potentially search it, but people seemed to feel it made the test too easy - since everything was in that book. So they made it an image PDF.

Then, they put a warning on their site saying that they explicitly consider it to be misconduct search the book (e.g., by making the book searchable with OCR):

> The NYLC/NYLE Course Materials are locked in a non-searchable format in accordance with the Board’s misconduct rule prohibiting candidates from electronically searching the Course Materials when taking the NYLE. If a candidate, because of a disability, uses a screen reader to access written material, please contact the Board office by phone (518-453-5990), mail, or fax.

Kind of silly, sure! They don't want you searching the book during the exam, but they're fine with you going through it. And, following passing the bar exam, there is a character and fitness process, so students are fairly terrified of doing anything unethical, particularly things the bar explicitly says is unethical. So, it's basically an honor system, but with a big stick (although I haven't heard of any actual enforcement). If you OCR it, brag about it to your friends, and your friends really hate you, I guess they could report it.

[1]https://www.nybarexam.org/Content/NewYorkCourseMaterials.pdf

As mentioned in your quote there’s a special version intended for disable candidates with larger text and searchable content so as to be compatible with screen readers. The version is freely available (not publicly but indexed by search engines) on their website.
On an even more macro level I've had a great experience with ripgrep-all[0], which uses Tesseract internally.

I have e.g. a directory with all weekly lecture slides for one lecture, and can directly find where (both file and page) we learned something related to photosynthesis via `rga photoshynthesis`.

[0]: https://github.com/phiresky/ripgrep-all

rga is one of my favorite tools ever.
I've read books that were scanned with OCR. You learn to deal with typical errors that occur, but overall the texts were mostly accurate.