Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 © and xerces (Java).
DOM Parser for XML, HTML4, and HTML5
SAX Parser for XML and HTML4
Push Parser for XML and HTML4
Document search via XPath 1.0
Document search via CSS3 selectors, with some jquery-like extensions
XSD Schema validation
“Builder” DSL for XML and HTML documents
You can install it as a gem:
gem install nokogiri
or add it into a Gemfile (Bundler):
Nokogiri is a large library, and so it’s challenging to briefly summarize it. We’ve tried to provide long, real-world examples at Tutorials.
Parsing and Querying
Here is example usage for parsing and querying a document:
Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return a string containing markup (like
inner_html) will return a string encoded like the source document.
Some documents declare one encoding, but actually use a different one. In these cases, which encoding should the parser choose?
Data is just a stream of bytes. Humans add meaning to that stream. Any particular set of bytes could be valid characters in multiple encodings, so detecting encoding with 100% accuracy is not possible. libxml2 does its best, but it can’t be right all the time.
If you want Nokogiri to handle the document encoding properly, your best bet is to explicitly set the encoding. Here is an example of explicitly setting the encoding to EUC-JP on the parser:
doc = Nokogiri.XML('<foo><bar /></foo>', nil, 'EUC-JP')