Hacker News new | ask | show | jobs
by cratermoon 429 days ago
"Step 3: robots.txt"

Will do nothing to mitigate the problem. As is well known, these bots don't respect it.

1 comments

Would you reckon OP's bot(s) respect it when borrowing content from the large variety (their words) of podcast sources they scrape?
Hi, I'm the author of the blog (though I didn't post it on HN).

I've addressed this topic in another comment above and will copy it here.

I'd encourage you to read up on how the podcast ecosystem works.

Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.

All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.

This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.

Replace "podcasts" with "search results" in your comment, and "RSS feed" with "LLM output" and you've got yourself the exact same argument for what's going on today. The company names are different, of course, but not by much because some of the players stayed the same.

Your lack of reply to "do you observe robots.txt when you download content such as images" is basically a "no".

If they are well-coded, they don't constantly crawl. They use and pay attention to headers like ETag, If-Modified-Since and/or If-None-Match and support conditional requests.

Badly behaving RSS readers on the other hand....

https://rachelbythebay.com/w/2024/05/27/feed/