Hacker News new | ask | show | jobs
by drakenot 3815 days ago
This has been my personal project for the past few months. There are around 240k podcasts in the iTunes index and it is fairly trivial to scrape their feed urls.

Most of the podcast apps that have feed crawler backends (Pocket Casts, Overcast, etc) poll all 240k podcast feeds fairly frequently. More popular podcasts are polled on the order of every 2-3 minutes while less popular podcasts may only get polled every 10-15 minutes. This comes out to around 1.5 - 2 billion web requests per month.

It is important when you are making your feed requests that you set your last-modified and etag headers. These will speed up your requests significantly by having the servers send you a Not Modified (304) response if nothing has changed since your last poll. Something like 60% of the feeds support this.

You'll also want to keep a hash of the feed content. That way, when you get back a 200 response with the feed contents you can do a quick check to see if the feed content has changed since your last poll (for those servers that don't support etag). This will even further reduce the number of feeds you need to actually parse.

For those that returned a 200, and had a different hash, you now need to parse the feeds. There are a large number of podcasts which insert dynamic data into their feed. Some insert dynamic tracking query items into feed items. Or they make the some of the RSS feed dates the current time stamp (which is incorrect). These feeds with dynamic data will have to be fully parsed every time, which is a bummer. I've considered a future enhancement to my crawler that detects the feeds that do this and flip a bozo bit on them so I poll them less frequently.

The majority of podcast feeds are RSS 2.0. I'd have to check, but I think < 2% of podcasts in my database used Atom feeds. This was something that surprised me when I started the project. I spent a lot of time worrying about Atom feeds, or older RSS feeds but you could almost ignore them entirely and still capture most of the podcasts.

Parsing these feeds robustly is a whole topic unto itself. Many RSS/XML parsers are very strict. However, for this use case you don't want strictness. You want to extract the info out of the maximum number of feeds possible, even if some of them are malformed in some way. Perhaps the user didn't properly specify an XML namespace they are using. Or they are missing a closing tag for an element, etc.

Because the RSS spec doesn't require a GUID for items in the feed, you have to come up with your own algorithm for matching items with your new feed response. Many articles will tell you to use GUID if available, and if not, use Link. Or some combination of the above. However, for podcasts, you can almost always be assured that a podcast will have a url to the media file. So, I suggest using that as part of your matching algorithm in the absence of a GUID.

I plan on writing a more detailed article on this project as I get closer to finishing my crawler and submitting it to HN. As a further constraint, I'm attempting to get the monthly hosting costs for my distributed crawler to around $100/mo and it be capable of updating every podcast feed every 5 minutes.

1 comments

Please please please donate your invaluable collection to archive.org!