Hacker News new | ask | show | jobs
by lxgr 203 days ago
Worse: Scrapers that care enough will probably just take a screenshot using a headless browser and then OCR that if they care enough.
2 comments

When building a mini corporate filings digest generator, I very quickly switched to using tesseract over reading the selection layer in the pdf.

Unfortunately it is the most reliable way to get readable text out...

Also does guard against prompt injection via white text eh?

Or they'll just strip those Unicode characters out of the text. Automation is trivial.