Hacker News new | ask | show | jobs
by golergka 491 days ago
I'm building a service which needs to extract rss feeds from pages (hntorss.com if you're interested). Nothing else. From any rational point of view, website owner would actively want this parser to work as easily as possible — the whole point is for users to see the content you publish!

Alas, I still get rate-limited, 400-ed and others because of user agent and other bot-detection mechanisms.

1 comments

> the whole point is for users to see the content you publish!

no, the whole point (for most sites) is to make money off the users visiting said site (currently via advertising).

Another third party service which slurps the data, and redirect the users to a different site to consume the data means the original site lost the revenue, but paid the bandwidth cost.

So it's understandable that many sites want to block such agents.

Even if it is not for profit (or especially so), the point of any publication is not just to get people to know something—you at least want people to read what you wrote and appreciate you for this.

Using Web normally, with search and all, is well-behaved in this regard, but using attribution-stripping technology isn’t.

If your readers don’t know you exist and you don’t know who your readers are or if they even exist, you basically become a ghost writer, content producer for LLMs (and in many cases some commercial LLM operator also makes money off your work, too).

Then you wouldn't have RSS feeds in the first place. I'm talking about sites that decide to have them for one reason or another.