|
(I want to preface this comment by saying I don't mean in any way to rebut the thrust of the article, I worried this could be taken that way, rather it's a commentary on the inconsistency of feed implementations that I don't find discussed much.) As a somewhat orthogonal observation, I've found it's surprisingly hard to write a crawler that's well behaved with consistent heuristics across a variety of different feed providers. Usage of published vs pubdate vs updated, which is changed when, (or all three just containing 1970 and/or the current time), reordered feeds, items published out of order _regardless_ of the scheme used, changing urls/ids, etc. Whatever set of heuristics one uses for some sites may not apply to others. Now, this begs the question of "why not make parameterize and tune", and yes, this is largely what I've resolved to, but it's, to the core point, more of a moving target than one would expect from how ostensibly simple RSS is. At the end of the day, in many cases I just fall back to using a cache of recently-seen-urls and, when possible, short-circuit the enumeration when I cross one. (Similar disclaimer, I do love me some RSS, I've just never had a good opportunity to rant about this.) |