| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tlrobinson 3077 days ago
	All of which is defeated by OCR.

2 comments

eastendguy 3077 days ago

Good point. OCR powered web scraping is even available out of the box nowadays.

https://a9t9.com/kantu/docs/scraping#ocr

link

j_coder 3077 days ago

It is not the OCR that is costly. It is the JavaScript execution to render the page so you can do the OCR. You can even increase the JavaScript execution cost if suspicious.

You will also have to automate all page variations and the traditional challenges (login, captcha, user behavior fingerprinting, ...)

At the end the development time, cost and server cost will kick you out of business if you are too dependent on the information or you start to loose money every time you scrap.

link

j_coder 3077 days ago

Yes. The idea here is to make you dependent on OCR (you also have to find where is the information as the page design changes) and to waste a lot of your server resources making it very costly to scrape.

link