Using Nokogiri to Parse XML Documents

August 30, 2014

The other day I needed to parse some XML, I used Crack, which is a great library, but wanted to checkout what other libraries I could use. Today going through The Ruby Toolbox I came across Nokogiri, which is a widely used library.

Straight from the Nokogiri site:

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.

Nokogiri seems much better suited to parsing HTML. Nokogiri has methods that make it easy to select elements using CSS3 selectors. For parsing XML Nokogiri gives you access to the document via XPath. I have never been a big fan of XPath; it does simplify accessing elements in an XML file, but it’s still not my favorite.

Enough talk let’s parse an XML document. We will use the same document that I used in the previous post on parsing XML. To remind you of the document we are parsing here it is:

Our requirement is simple, pull out the shipping address and display the name, street, city, state, zip and country.

Now Nokogiri needs a few supporting libraries so you will most probably need to install libxml2 and libxslt. The easiest way to install both libxml2 and libxslt is to use Homebrew. There are numerous tutorials to install Homebrew, use one of those if you don’t have it installed yet.

Once you have libxml2 and libxslt, you can install Nokogiri by adding it to your gemfile or install the gem

gem install nokogiri

Here is the code to parse the XML and pullout the details we need:

Firstly we read the XML file into a variable and use Nokogiri to parse the XML document. To get our Shipping address we use the XPath query:

/PurchaseOrder/Address[@Type='Shipping']

That says, “Looking for the Address with a Type of Shipping under PurchaseOrder.” We then have the XML fragment necessary to get to the name, street, city, state, zip and country elements. We can get to these elements by using another XPath to each element and calling ‘text()’ to get to the values. By putting all the elements, we care about into a ‘details’ variable we can call the put method to print the results to the screen.

As I said earlier, Nokogiri seems to be better-suited at parsing HTML give its CSS3 selectors. I am not a big fan of XPath so if I were to choose which library to use I would much rather use Crack then Nokogiri.


Discussion, links, and tweets

My name is Deon Heyns and I am a developer learning things and documenting them in realtime. Python, Ruby, Scala, .NET, and Groovy are all languages I have written code in. I appeared in the New York Post once. I host my code up at GitHub and Bitbucket so have a look at my code, fork it and send those pull requests.

comments powered by Disqus