[Awesome Ruby Gem] Use nokogiri gem to work with XML and HTML in Ruby
Nokogiri
Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 © and xerces (Java).
Features Overview
-
DOM Parser for XML, HTML4, and HTML5
-
SAX Parser for XML and HTML4
-
Push Parser for XML and HTML4
-
Document search via XPath 1.0
-
Document search via CSS3 selectors, with some jquery-like extensions
-
XSD Schema validation
-
XSLT transformation
-
“Builder” DSL for XML and HTML documents
Installation
You can install it as a gem:
1 | gem install nokogiri |
or add it into a Gemfile (Bundler):
1 | # Gemfile |
Then, run bundle install
.
1 | bundle install |
Usages
Nokogiri is a large library, and so it’s challenging to briefly summarize it. We’ve tried to provide long, real-world examples at Tutorials.
Parsing and Querying
Here is example usage for parsing and querying a document:
1 |
|
Encoding
Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return a string containing markup (like to_xml
, to_html
and inner_html
) will return a string encoded like the source document.
WARNING
Some documents declare one encoding, but actually use a different one. In these cases, which encoding should the parser choose?
Data is just a stream of bytes. Humans add meaning to that stream. Any particular set of bytes could be valid characters in multiple encodings, so detecting encoding with 100% accuracy is not possible. libxml2 does its best, but it can’t be right all the time.
If you want Nokogiri to handle the document encoding properly, your best bet is to explicitly set the encoding. Here is an example of explicitly setting the encoding to EUC-JP on the parser:
1 | doc = Nokogiri.XML('<foo><bar /></foo>', nil, 'EUC-JP') |
References
[2] nokogiri | RubyGems.org | your community gem host - https://rubygems.org/gems/nokogiri/