| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by j_coder 3075 days ago

It is "easy" to block scraping. Make it very costly to scrape:

- Render your page using canvas and WebAssembly compiled from C, C++, or Rust. Create your own text rendering function.

- Have multiple page layouts

- Have multiple compiled versions of your code (change function names, introduce useless code, different implementations of the same function) so it is very difficult reverse engineer, fingerprint and patch.

- Try to prevent debugging by monitoring time interval between function calls, compare local time interval with server time interval to detect sandboxes.

- Always encrypt data from server using different encryption mechanisms every time.

- Hide the decryption key into random locations of your code (use generated multiple versions of the code that gets the key)

- Create huge objects in memory and consume a lot of CPU (you may mine some crypto coins) for a brief period of time (10s) on the first visit of the user. Make very expensive for the scrapers to run the servers. Save an encrypted cookie to avoid doing it later. Monitor concurrent requests from the same cookie.

The answer is that it is possible but it will cost you a lot.

1 comments

tlrobinson 3075 days ago

All of which is defeated by OCR.

link

eastendguy 3075 days ago

Good point. OCR powered web scraping is even available out of the box nowadays.

https://a9t9.com/kantu/docs/scraping#ocr

link

j_coder 3075 days ago

It is not the OCR that is costly. It is the JavaScript execution to render the page so you can do the OCR. You can even increase the JavaScript execution cost if suspicious.

You will also have to automate all page variations and the traditional challenges (login, captcha, user behavior fingerprinting, ...)

At the end the development time, cost and server cost will kick you out of business if you are too dependent on the information or you start to loose money every time you scrap.

link

j_coder 3075 days ago

Yes. The idea here is to make you dependent on OCR (you also have to find where is the information as the page design changes) and to waste a lot of your server resources making it very costly to scrape.

link