Hacker News new | ask | show | jobs
by Iwillgetby 2301 days ago
If you upload a pdf to google drive and download it 10 minutes later it will magically have BY FAR the best OCR results in the pdf. Note my pdf tests were fairly clean so your experience may not be the same.

I have used Google's fine OCR results to simulate a hacker.

- Download a youtube video that shows how to attack a server on the website hackthebox.eu

- Run ffmpeg to convert the video to images.

- Run a jpeg to pdf tool.

- Upload the pdf to google drive.

- Download the pdf from google drive.

- Grep for the command line identifiers "$" "#".

- Connect to hackthebox.eu vpn.

- Attack the same machine in the video.

4 comments

Right? I love the OCR for Google Drive. It's such a useful, hidden feature.

By the way, why do you wait 10 minutes? Is there a signal that the PDF is done processing?

Or is there just some kind of voodoo magic that seems to happen that just takes 10 minutes to do?

2 minutes is probably long enough. I did notice that google drive doesn't seem to like it if you upload a lot of files. I have had files sit and never get OCR, but I forgot about them so they may have OCR on them now.

Also, I am not aware of a signal when it is done.

You got to love modern software. It may do it or not. It may do it within an unknowable timeframe. But if it does it, it’s wonderful.
Google Drive can directly OCR jpeg or any image. Just upload and open it with Google Docs.

Now I think about it, I don't know what you mean by "upload a pdf to google drive and download it 10 minutes later".

Uploading and downloading a file shouldn't change it at all, at bit level.

>- Run a jpeg to pdf tool.

ImageMagick. convert *.jpg out.pdf

This solution is absolutely beautiful