Hacker News new | ask | show | jobs
by solid_fuel 8 days ago
I do appreciate you addressing the concerns about traffic hijacking, but at the same time I really don't like having my content run through a text mangler like an LLM. I get the use case, but at the end of the day it's my content and I'm a bit prickly.

That said, I'm not necessarily planning to immediately block your crawlers, I intend to just add them to a list I maintain for personal reference. I'm mostly interested in correlating the crawling traffic that I see with various sources, I have been gathering data about crawling activity and sources that I display on an embedded map on my site. I have caddy annotate traffic with a header indicating what the crawler is, and if the fleet behaves nicely then they don't get added to the blocklist.

1 comments

Interesting. in terms of "crawling", the way the engine I built works is by default it's just polling the rss feed of a site on an adjusting cadence like any other rss feed reader. On some sites, the engine can do a follow up scrape of the article link from the rss feed if the full content of the article isn't provided in the rss feed. So it's not real crawling, more fetching/scraping if necessary.

But I hear you.