[Awesome Ruby Gem] Use nokogiri gem to work with XML and HTML in Ruby

Nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 © and xerces (Java).

Features Overview

  • DOM Parser for XML, HTML4, and HTML5

  • SAX Parser for XML and HTML4

  • Push Parser for XML and HTML4

  • Document search via XPath 1.0

  • Document search via CSS3 selectors, with some jquery-like extensions

  • XSD Schema validation

  • XSLT transformation

  • “Builder” DSL for XML and HTML documents

Installation

You can install it as a gem:

1
$ gem install nokogiri

or add it into a Gemfile (Bundler):

1
2
3
4
5
# Gemfile

# sparklemotion/nokogiri: Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
# https://github.com/sparklemotion/nokogiri
gem 'nokogiri', `1.12.2'

Then, run bundle install.

1
$ bundle install

Usages

Nokogiri is a large library, and so it’s challenging to briefly summarize it. We’ve tried to provide long, real-world examples at Tutorials.

Parsing and Querying

Here is example usage for parsing and querying a document:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

# Fetch and parse HTML document
doc = Nokogiri::HTML(URI.open('https://nokogiri.org/tutorials/installing_nokogiri.html'))

# Search for nodes by css
doc.css('nav ul.menu li a', 'article h2').each do |link|
puts link.content
end

# Search for nodes by xpath
doc.xpath('//nav//ul//li/a', '//article//h2').each do |link|
puts link.content
end

# Or mix and match
doc.search('nav ul.menu li a', '//article//h2').each do |link|
puts link.content
end

Encoding

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return a string containing markup (like to_xml, to_html and inner_html) will return a string encoded like the source document.

WARNING

Some documents declare one encoding, but actually use a different one. In these cases, which encoding should the parser choose?

Data is just a stream of bytes. Humans add meaning to that stream. Any particular set of bytes could be valid characters in multiple encodings, so detecting encoding with 100% accuracy is not possible. libxml2 does its best, but it can’t be right all the time.

If you want Nokogiri to handle the document encoding properly, your best bet is to explicitly set the encoding. Here is an example of explicitly setting the encoding to EUC-JP on the parser:

1
doc = Nokogiri.XML('<foo><bar /></foo>', nil, 'EUC-JP')

References

[1] sparklemotion/nokogiri: Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. - https://github.com/sparklemotion/nokogiri

[2] nokogiri | RubyGems.org | your community gem host - https://rubygems.org/gems/nokogiri/

[3] Nokogiri - https://nokogiri.org/