| Do you know how/why it is such a hard problem? I assume you have: * Browser data (UAS, screen size, network speed, localStorage) * IP data (therefore some proxy of geographic data) * Third party data (eg. Google Analytics demographic data) * Odd click patterns (eg. from the same IP, bursts within a short window) * Finally you can see who is benefiting from the clicks (eg. certain publishers) and suspend their account I feel like all this data would generate a substantial click "footprint" that you could run through an ML model. At worst, these third-party advertising companies can suspend whoever is benefiting from the clicks if they gather enough suspicious evidence. |
Behavior tracking is difficult since it's hard to say that a legitimate user will never do something. E.g. large ISP NATs thwart IP tracking by giving many customers the same IP. Safari blocks 3rd party cookies.
Google has a somewhat well known bot countermeasure called botguard that does a decent job proving that you are probably running an entire browser, but that only marginally increases the cost of fraud to running a browser instance per-bot. Increasing per-impression cost for fraudsters can put them out of business, but increasing per-impression cost to detect fraudsters can put advertisers out of business.
Also, ad-targeting is often a realtime problem. You have to decide what ad, if any, to show within milliseconds. Do you never show ads to unrecognized users? How much turnaround time will you need before you can precompute a profile and start showing ads to a legitimate user? How much turnaround time do you need for detecting and blocking fraud?
Unfortunately, specific countermeasures aren't often publicly published since one of the greatest costs of ad fraud is figuring out and then circumventing countermeasures. E.g. you might have a hard time reverse engineering something faster than it's being engineered by 20 people at Google.