Last Monday I wrote a post exploring the RSS Validator’s claim that RSS 2.0’s
pubDate and Dublin Core’s
dc:date elements were considered duplicates. I asserted that they overlapped, but had different semantic meaning and precision.
The feedback from both gentleman concerned this section:
Which leads us to the justification for removing dc:date, so as not to “confuse news aggregators”. As someone who maintains and builds aggregators, I don’t buy this. I wouldn’t think anything introduced with namespaces would take precedence over a mandatory element like pubDate.
Geoff (not the railfan Geoff from the last post!) chimed in:
I agree with your comments. Compatibility concerns between [feed formats] were always overblown (and lead to that delicious irony of Atom). It’s very easy to write a decision tree based on the presented version of RSS or Atom to weight which date to respect (pubDate in RSS 0.9x and 2.0, RDF like your Dublin Core example in RSS 1.0).
The “delicious irony of Atom” referred to an earlier conversation we had where a new incompatible format was introduced to address incompatibilities. I actually appreciate what Atom set out to do, but there’s a well-worn xkcd that shows the practical reality.
All of these choices [about updating an article when a feed changes] hinge on being able to identify that the new edited article as being “the same” or not as the old article. You need a unique ID attached to the article for this to work, otherwise it’s a guessing game.
I’ve omitted how I find the “date” in the first place, but it’s a similar list of “Look for thing 1, if that fails looks for thing 2, if that fails …”. In my case I fall back to using the “title” of the article before I use the “date”, but that’s not necessarily the best thing to do if you want to avoid duplicate articles (and so other feed readers/parsers probably won’t do this).
This ties in with my Perl post last month about moving to Perl’s XML::LibXML. I commented that I prefer using a general XML parser over specific packages for RSS, Atom, or OPML, but didn’t give more detail why. Aside from needing to only learn how one package works, it’s so that I can handle edge cases like what Hales describes. Turns out that even among packages in the same language, they handle updates, identifiers, dates, and other nomenclature subtly differently. I prefer pulling in data from whichever format is presented, and handling the dates and other data in the same data structure. I like to think Perl’s hashes and syntactic sugar are especially suited to this, but that’s for another post.
Or as Hales summarised, “welcome to RSS!” It’s funny how I find these sorts of things challenging but ultimately fun and rewarding, as opposed to getting the blessing of a specific API written by a large social network that wants to own the Internet and control our social graphs. Walled gardens are pretty, as long as you follow their [ever changing] rules and terms. The open web is messy.