Hacker News new | ask | show | jobs
by pradn 750 days ago
What's fascinating here is AdFlush is a classical feature engineering approach: define a bunch of features on the data manually, and then use ML to figure out the most useful / impactful ones. This is not the "throw terabytes of data and see what happens" approach we see with LLMs. It's a bit funny to even point this out because I don't recall the last time a feature-engineered ML project made it to the HN front page.

Features can be brittle, but they are understandable. The paper's appendix [1] lists the 27 features that will likely make a request/resource "ad-related". These include interesting ones like JS AST depth, average JS identifier length, the "bracket to dot notations ration in JS", and a number of graph measures for the graph of scripts.

And contrary to what comments in this thread are saying, they do compare against a blocklist-based adblocker: uBlock Origin. That's in section 5.5. They say they outperform uBlock Origin. But even they say they don't reduce overall page time bc their algorithm is expensive.

[1]: https://dl.acm.org/doi/pdf/10.1145/3589334.3645698

2 comments

More specifically, page load time was 2.7 seconds without adblocker, decreased to 2.1 with uBlock Origin, but increased by 250% to 6.6 seconds with AdFlush, or increased to 3.4 seconds with AdFlush retaining prior predictions.

The superior score was an F1 of 0.86 vs 0.84 for AdFlush vs uBlock Origin, and it's not clear to me that this is a statistically significant difference. They do not claim it is.

That seems to argue for a first pass with a blocklist to filter out the well-known ad providers, and then possibly a followup step with the ML to catch things that are trying harder to slip by? But the extensions would have to cooperate to make that possible.
Thanks for extracting the details. It doesn't seem like they'll be competitive with blocklist-based approaches like uBlock Origin, because their features are fundamentally expensive to compute - parsing JS and such, not just matching URLs against a list of regexes.
Seems like it could work in the background to build up new rules for uBlockOrigin to deploy
I like the strategy of using flags to say "look into this suspicious part of the code" over a hardcoded block list. And also block shitty JS via "JS AST depth, average JS identifier length" etc even if it's not an ad but just bad code.

For Brave browser users, you can see what hardcoded lists you're using at brave://adblock .

As for the whole cat and mouse game, how to detect an "ad" if it's served with the content fully sever-side? Now _that_ needs some serious ML to decipher.

> how to detect an "ad" if it's served with the content fully sever-side? Now _that_ needs some serious ML to decipher.

This has been my red line on where I will allow ads vs blocking them. If a site is hosting their own ads, that's acceptable to me. If they are using an ad provider, that is not. The newspaper example is my go to. If you wanted your ad in a paper, you called the paper and took out an ad. Today's equivalent would be every time you opened the paper, a slight delay while it randomly chose the highest bids for the ad space while potentially also inserting something that would slowly eat your hands. That's a nope.

You are obviously in the camp that feels entitled to be able to read anything at anytime without allowing for a website to earn money by wanting to block all ads regardless of their origin.

> You are obviously in the camp that feels entitled...

Not at all. I use Brave and "shield down" websites that I like and generally keep their ad situation under control (incl. 3rd party). But your point of hosting vs 3rd party is a good one and especially because often one 3rd party connects to another.

Likewise, I "block" annoying parts of websites like Yahoo Fantasy Football's enormous top nav that's not even an ad.