Hacker News new | ask | show | jobs
by sangnoir 531 days ago
Speaking as a person who has played on both offense and defense: this is a heuristic that's not used frequently enough by defenders. Clients that load a single HTML/JSON endpoint without loading css or image resources associated with the endpoints are likely bots (or user agents with a fully loaded cache, but defenders control what gets cached by legit clients and how). Bot data thriftiness is a huge signal.
2 comments

Even legitimate users might want to disable CSS and pictures and whatever, and I often do when I just want to read the document.

Blind users also might have no use for the pictures, and another possibility is if the document is longer than the screen so the picture is out of view then the user might program the client software to use lazy loading, etc.

Indeed, that's why it's one heuristic/signal among many others
As a high load system engineer you'd want to offload asset serving to CDN which makes detection slightly more complicated. The easy way is to attach an image onload handler with client js, but that would give a high yield of false positives. I personally have never seen such approach and doubt its useful for many concerns.
Unless organization policy forces you to, you do not have to put all resources behind a CDN. As a matter of fact, getting this heuristic to work requires a non-optimal caching strategy of one or more real or decoy resources - CDN or not. "Easy" is not an option for the bot/anti-bot arms race, all the low hanging fruit is now gone when fighting a determined adversary on either end.

> I personally have never seen such approach and doubt its useful for many concerns.

It's an arms race and defenders are not keen on sharing their secret sauce, though I can't be the only one who thought of this rather basic bot characteristic, multiple abuse trams probably realized this decades ago. It works pretty well against the low-resource scrapers with fakes UA strings and all the right TLS handshakes. It won't work against the headless browsers that costs scrapers more in resources and bandwidth, and there are specific countermeasures for headless browsers [1], and counter-countermeasures. It's a cat and mouse game.

1. e.g. Mouse movement, as made famous as ine signal evaluated by Google's reCAPTCHA v2, monitor resolution & window size and position, and Canvas rendering, all if which have been gradually degraded by browser anti-fingerprinting efforts. The bot war is fought on the long tail.