Recipe: Parsing RSS and Atom Feeds

Sometimes it's desirable to be able to ingest a remote RSS or Atom feed in order to make content available within a web application. Clearly, the easiest way to expand the content offerings of a web site is to incorporate content from other sources. Standards like RSS and Atom were designed precisely to support the syndication of content in this fashion.

The first thing that pops into the heads of developers when this kind of requirement comes up is the dawning realization that they may have to create some really ugly XML-parsing code. It just sounds like one of hose dreary, painful programming tasks that occasionally come down the pike.

The Problem

Ingest RSS or Atom feeds and parse the content so that it can be repurposed for the needs of a Rails web application.

The Solution

The HTTParty gem makes it almost trivial to parse both RSS and Atom feeds. Listing 1 shows the Ruby code for the Feed class.

Listing 1: The Feed Class

  class Feed
    include HTTParty
    format :xml

    def initialize(feed_url)
      @feed_url = feed_url
    end
  
    def feed_url
      @feed_url
    end
  
    def url
      uri = URI.parse(@feed_url)
      strip_feed_extension(uri.scheme + '://' + uri.host + uri.path)
    end

    def latest(params={})
      response = {}
      begin
        response = Feed.get(@feed_url)
      rescue REXML::ParseException => e
        RAILS_DEFAULT_LOGGER.warn("forum feed parse error: " + e.message)
        response["feed"] = ""
      end
    
      response["feed"]
    end
  
    private
  
      def strip_feed_extension(uri)
        str = uri.sub(/.atom/, '')
        str.sub(/.rss/, '')
      end
  end

Place the feed.rb class in the lib directory of your Rails application. Then run script/console to bring up a console.

> f = Feed.new('http://www.keenertech.com/articles.atom')
> feed = f.latest

That's all there is to it. The feed has been parsed already. So, let's view some summary information about the feed.

> feed['title']
KeenerTech.com
> feed['link']['href']
http://www.keenertech.com/articles.atom

Well, that's great, but what about the entries?

> entries = feed['entry']
[ {}, {}, …]
> e = entries[0]
> e['title']
Leveraging Rails to Build Facebook Apps
> e['author']['name']
David Keener
> e['link']['href']
http://www.keenertech.com/articles/2010/09/29/leveraging-rails-to-build-facebook-apps
> e['summary']
My presentation on "Leveraging Rails to Build Facebook Apps," which I just gave at SunnyConf, is now available online. This presentation is a distillation of some of the practical tactics that my development team at MetroStar Systems has used to create highly successful…

Now, to quote SpiderMan, "with great power comes great responsibility." HTTParty is just using REXML to do the parsing, which isn't the speediest parser around but it's more than good enough for most processing tasks.

Still, for performance reasons, you wouldn't want to parse a remote XML feed every time a particular web page was requested. So, this is the type of task that demands some form of data caching, whether memcache or simply storing feed data in the database for later use.



Comments

David Keener By dkeener on Sunday, February 13, 2011 at 02:27 PM EST

Technically, the code parses the XML and stores it in memory as nested data structures. You have to pull out the elements that you need for RSS, RSS2 and Atom.

The code above works for one flavor of Atom, but not for some variations of RSS, RSS2 and Atom (where the tags are slightly different). I should have made that more clear in the article.



Leave a Comment

Comments are moderated and will not appear on the site until reviewed.

(not displayed)