Hacker News new | ask | show | jobs
by fgblanch 521 days ago
I don't know if it would come with the deal, but Bytedance web crawler is known to be the one with top number of requests per day among AI crawlers (src: https://blog.cloudflare.com/declaring-your-aindependence-blo... ) I guess one of Perplexity challenges is to have their own web index and of course that starts with having a powerful crawler. Also having a powerful crawler is useful for capturing tokens to train models. If that technology comes with the deal, it makes perfect sense for Perplexity to acquire them.
1 comments

Funnily enough the Cloudflare blog identifies Perplexity engaging in dodgy practices to avoid robots.txt denylists:

> Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent. We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot, even when operators lie about their user agent.

Clearly not working too well.

Lol, I had to report Facebook using the documented Facebook crawler UA, coming from Facebook ASN as a bot to them because they misclassified it. Don't expect too much from their global machine. I wonder if this case also included people manually reporting it...