Command-line tool for spidering sites and extracting XML/HTML content
Xidel is a robust tool for spidering, extracting and transforming XML/HTML content from the command line.
curl with a CSS and XPath/XQuery engine (among other features), attached.
xidel doesn't seem to be in the package management repositories I normally use, but you can download it here.
The following example will (1) download a web page, (2) extract a list of links (specified via CSS selector) from it, (3) download the page corresponding to each of those links and finally (4) extract specific pieces of content (specified by CSS selectors) from each page:
xidel [URL-OF-INDEX-PAGE] \ --follow "css('[CSS-SELECTOR-FOR-LINKS]')" \ --css "[CSS-SELECTOR-FOR-SOME-TEXT]" \ --extract "inner-html(css('[CSS-SELECTOR-FOR-SOME-HTML]'))"
As a concrete example, the command:
$ xidel http://reddit.com -f "css('a')" --css title
will download every page linked from the reddit.com homepage and print the content of its
There are several more examples on the Xidel site.