Hacker News new | ask | show | jobs
by hawski 2232 days ago
That's an idea I had for a few years now. I started some motions [0], but progress was slow, because of life. I wanted to start with going through the Common Crawl [1] data at first for testing purposes and to calculate a rough percentage of sites being uBlock-Origin clean.

I think that such sites would be in ballpark of a few ‰. That would enable me to offer the contentless index for download. With delta updates and torrent for distribution it could be not that expensive, but that's a thing that I could charge for.

My intention is to use AdBlock rules like easylist to check whether or not indeed the page.

My initial code is fine in Go, but I lost enthusiasm for Go lately and careerwise it's not a good fit for me (I don't have much time to learn something not as useful for me professionally). So I started to rewrite it in Rust, while learning it, you can laugh now (Rust Evangelism Strike Force el oh el). It has an advantage with ready to use rules parser from Brave [2] and presumably high quality tokenizer from html5ever [3].

I want to use a tokenizer instead of a full parser to be able to do stream processing bringing costs down.

Common Crawl data lays on S3 so the processing must be done initially on EC2 to keep it low cost.

[0] Current Go code: https://github.com/hadrianw/abracabra

[1] https://commoncrawl.org/

[2] https://github.com/brave/adblock-rust

[3] https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.h...

EDIT:

Also for the search part I want to use something more stand alone than Elasticsearch to offer desktop search with downloaded index. When I started with Go I wanted to use Bleve [4], now I'm not sure, but I think that Bleve is getting mature enough. I will worry when I will have some data to search through.

One of the challenges with this whole enterprise is a small need of JavaScript parsing. There is a common pattern, that for example Google Analytics uses, that uses a snippet of JavaScript to insert a proper script tag. But those snippets are very short so I think they may not need a full JS VM, maybe even a tokenizer would be good enough. Browser AdBlockers base on the site executing JavaScript already.

[4] https://blevesearch.com/