Hacker News new | ask | show | jobs
by jlund-molfese 1202 days ago
They have their own index[1]. It's not easy, when a bunch of sites block anyone who isn't Google or Bing. But this is the same strategy Brave seems to be pursuing, where they try to rely more and more on their own indices.

[1] http://teclis.com

2 comments

> The crawler is hybrid, using async python requests and puppeteer with uBlock Origin. The way detection works is we count the number of uBO blocked requests on the page, and if too many (threshold is set to 5), we kick it out, leaving only "clean" pages in the index.

Fascinating; cnn.com reports 47 on the front page, npr.org is at 16, developer.hashicorp.com is at 9. I don't think that metric is doing what they think it is, or rather maybe they're trying to target only savanna.gnu.org style sites or something

Good to know they are working on this.

Is there a legal issue with spoofing user agent to be the google crawler? Spoofing is certainly enough to get rid of article paywalls for 99% of sites Ive encountered. At least last I heard you can also work around cloudflare captcha by just routing requests through a worker on their service.