| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by solid_fuel 16 days ago
	As a sysadmin hosting a few blogs, do you mind sharing what IP ranges you crawl from? Or what agent your requests use? Thank you.

1 comments

dchuk 16 days ago

I presume you’re politely asking in order to block? Which is fine, I get it. On my phone right now but can update later.

I do want to ask though (and I should make this clear in a FAQ or something): the way I check RSS feeds uses adaptive scheduling, so I intentionally don’t check feeds of sites too rapidly. Then the summarization is based on the full article content but I never render that full content on the site (to avoid traffic hijacking concerns). Given that: what’s the concern?

link

solid_fuel 15 days ago

I do appreciate you addressing the concerns about traffic hijacking, but at the same time I really don't like having my content run through a text mangler like an LLM. I get the use case, but at the end of the day it's my content and I'm a bit prickly.

That said, I'm not necessarily planning to immediately block your crawlers, I intend to just add them to a list I maintain for personal reference. I'm mostly interested in correlating the crawling traffic that I see with various sources, I have been gathering data about crawling activity and sources that I display on an embedded map on my site. I have caddy annotate traffic with a header indicating what the crawler is, and if the fleet behaves nicely then they don't get added to the blocklist.

link

dchuk 14 days ago

Interesting. in terms of "crawling", the way the engine I built works is by default it's just polling the rss feed of a site on an adjusting cadence like any other rss feed reader. On some sites, the engine can do a follow up scrape of the article link from the rss feed if the full content of the article isn't provided in the rss feed. So it's not real crawling, more fetching/scraping if necessary.

But I hear you.

link