| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexpetralia 2625 days ago

Do you know how/why it is such a hard problem?

I assume you have:

* Browser data (UAS, screen size, network speed, localStorage)

* IP data (therefore some proxy of geographic data)

* Third party data (eg. Google Analytics demographic data)

* Odd click patterns (eg. from the same IP, bursts within a short window)

* Finally you can see who is benefiting from the clicks (eg. certain publishers) and suspend their account

I feel like all this data would generate a substantial click "footprint" that you could run through an ML model. At worst, these third-party advertising companies can suspend whoever is benefiting from the clicks if they gather enough suspicious evidence.

3 comments

singron 2625 days ago

There are multiple types of fraud. One is bots that give fake impressions, but another is fraudulent publishers that give improper ad placement (e.g. overlapping ads or invisible ads). In the second type, the user is legitimate, so you can't entirely rely on something that identifies illegitimate users. I think this is one reason why ads aren't always sandboxed in iframes since you need a way to detect if the ad is actually visible in the root frame.

Behavior tracking is difficult since it's hard to say that a legitimate user will never do something. E.g. large ISP NATs thwart IP tracking by giving many customers the same IP. Safari blocks 3rd party cookies.

Google has a somewhat well known bot countermeasure called botguard that does a decent job proving that you are probably running an entire browser, but that only marginally increases the cost of fraud to running a browser instance per-bot. Increasing per-impression cost for fraudsters can put them out of business, but increasing per-impression cost to detect fraudsters can put advertisers out of business.

Also, ad-targeting is often a realtime problem. You have to decide what ad, if any, to show within milliseconds. Do you never show ads to unrecognized users? How much turnaround time will you need before you can precompute a profile and start showing ads to a legitimate user? How much turnaround time do you need for detecting and blocking fraud?

Unfortunately, specific countermeasures aren't often publicly published since one of the greatest costs of ad fraud is figuring out and then circumventing countermeasures. E.g. you might have a hard time reverse engineering something faster than it's being engineered by 20 people at Google.

link

rightbyte 2625 days ago

"Also, ad-targeting is often a realtime problem."

Surely Google are caching a queue of adds for each user and similarly for "random unknown user"? Why would this have to be real time?

link

aggronn 2625 days ago

Programmatic advertising is 100% a real-time, per request bidding process. There is no queue of ads. Virtually all banner advertising on the web now is done this way.

link

Macha 2625 days ago

Just because it's Google's code on the publisher page, doesn't mean it's Google's customer's ad that shows up on the page. It's entirely possible a third party is willing to pay more than any of Google's own customers, so it's auctioned off to Google's customers, and Google's partners (who auction it among their own customers).

Also advertisers often want to do dynamic stuff too. Or may be willing to pay more for the same user in different contexts. Or utterly unwilling to have their ad on sites with UGC. And you don't know where the user will show up next.

link

ggthrowaway12 2625 days ago

I won't go into details but you seem to be assuming specific, relatively unsophisticated methods. Also, not everything you mention is available or useful and it's not close to enough to for more sophisticated frauds.[1][2] Keep in mind that most ads are paid on a per-impression basis - the main reason to simulate clicks is because at some point people will notice if a specific site consumes a bunch of impressions but doesn't contribute any clicks. Ad-tech companies tend to be competent in ML since it's necessary for optimization, but fraud remains a hard problem.

[1] https://www.buzzfeednews.com/article/craigsilverman/porn-run...

[2] https://clearcode.cc/blog/rtb-online-advertising-fraud/

link

anbop 2625 days ago

As soon as you have the ML model you have the method to train a fraud bot. Just keep iterating on it until it fools the model.

link