Hacker News new | ask | show | jobs
by blueboo 1690 days ago
Even if your DOM is obfuscated, the rendered page remains vulnerable to OCR. Obfuscate the rendered pixels and you’ll annoy your humans and eventually find that the scrapers’ OCR is superhuman.

Still, maybe AI comes into it. Maybe poisoning the data is the right way to do it conditioned on ML-juiced anomaly detection.

1 comments

pdf and print newspaper is still a massive pain in the ass to OCR accurately