19 tool snippets

Command-line tool for spidering sites and extracting XML/HTML content

Xidel is a robust tool for spidering, extracting and transforming XML/HTML content from the command line.

It's like wget or curl with a CSS and XPath/XQuery engine (among other features), attached.

xidel doesn't seem to be in the package management repositories I normally use, but you can download it here.

The following example will (1) download a web page, (2) extract a list of links (specified via CSS selector) from it, (3) download the page corresponding to each of those links and finally (4) extract specific pieces of content (specified by CSS selectors) from each page:

xidel [URL-OF-INDEX-PAGE] \
  --follow "css('[CSS-SELECTOR-FOR-LINKS]')" \
  --css "[CSS-SELECTOR-FOR-SOME-TEXT]" \
  --extract "inner-html(css('[CSS-SELECTOR-FOR-SOME-HTML]'))"

As a concrete example, the command:

$ xidel http://reddit.com -f  "css('a')" --css title

will download every page linked from the reddit.com homepage and print the content of its title tag.

There are several more examples on the Xidel site.

Published 11 Feb 2014
Tagged linux, tool, xml, css, html, xpath, one-liner and ops.

 

Backup or mirror a website using wget

To create a local mirror or backup of a website with wget, run:

wget -r -l 5 -k -w 1 --random-wait <URL>

Where:

  • -r (or --recursive) will cause wget to recursively download files
  • -l N (or --level=N) will limit recursion to at most N levels below the root document (defaults to 5, use inf for infinite recursion)
  • -k (or --convert-links) will cause wget to convert links in the downloaded documents so that the files can be viewed locally
  • -w (or --wait=N) will cause wget to wait N seconds between requests
  • --random-wait will cause wget to randomly vary the wait time to 0.5x to 1.5x the value specified by --wait

Some additional notes:

  • --mirror (or -m) can be used as a shortcut for -r -N -l inf --no-remove-listing which enables infinite recursion and preserves both the server timestamps and FTP directory listings.
  • -np (--no-parent) can be used to limit wget to files below a specific "directory" (path).
Published 10 Feb 2014

 

Pre-generate pages or load a web cache using wget

Many web frameworks and template engines will defer the generation the HTML version of a document the first time it is accessed. This can make the first hit on a given page significantly slower than subsequent hits.

You can use wget to pre-cache web pages using a command such as:

wget -r -l 3 -nd --delete-after <URL>

Where:

  • -r (or --recursive) will cause wget to recursively download files
  • -l N (or --level=N) will limit recursion to at most N levels below the root document (defaults to 5, use inf for infinite recursion)
  • -nd (or --no-directories) will prevent wget from creating local directories to match the server-side paths
  • --delete-after will cause wget to delete each file as soon as it is downloaded (so the command leaves no traces behind.)
Published 10 Feb 2014

 

Mapping port 80 to port 3000 using iptables

Port numbers less that 1024 are considered "privileged" ports, and you generally must be root to bind a listener to them.

Rather than running a network application as root, map the privileged port to a non-privileged one:

sudo iptables -A PREROUTING -t nat -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000

Now requests to port 80 will be forwarded on to port 3000.

Published 8 Feb 2014

 

Making CAPS-LOCK into a control key in X

Using xmodmap:

$ cat ~/.xmodmap
remove Lock = Caps_Lock
keycode 0x42 = Control_L
add Control = Control_L

$ xmodmap ~/.xmodmap
Published 8 Feb 2014
Tagged linux, debian and tool.

 

Quickly render a 'dot' (Graphviz) graph

On Linux and OSX the command:

dot -Txlib mygraph.gv

will quickly launch a lightweight window containing a dot rendering of the graph in mygraph.gv.

The rendering should automatically refresh when mygraph.gv is updated. (I've occasionally run into small glitches with this that force me to re-launch the window, but they are rare and obvious.)

The same -Txlib parameter works for the other Graphviz rendering engines, including neato, twopi, fdp, sfdp, circo, and patchwork.

Published 1 Jan 2014

 

Generate a random list of words with shuf

shuf is (in my experience) a little known GNU utility that selects random lines (or bytes) from a file.

For instance, the command:

shuf -n 3 /usr/share/dict/words

selects three words at random from the words dictionary.

Tagged linux, one-liner and tool.

 

Backup an SD card on Linux using 'dd'

#!/bin/bash
if [ -b "/dev/$1" ]
then
  outfile="sdcard-backup-`date +"%s"`.dd"
  echo "cloning /dev/$1 to $outfile"
  dd if=/dev/$1 of=$outfile
  echo "tgz-ing $outfile"
  tar zcvf $outfile.tgz $outfile
  echo "done."
else
  echo "Usage: $0 /dev/<device>"
fi
echo "to restore, unmount(?), then use:"
echo "tar Ozxf <file> | dd of=<device>"
Tagged linux, backup and tool.

 

Find large files on Linux.

UPDATE: Reader Luc Pionchon points out that sort often supports a -h parameter that sorts by "human" numbers, hence:

$ du -h * | sort -h | tail

is probably a better alternative than any of the following (for the systems that support it).

du -h * | grep "^[0-9.]*M" | sort -n

This finds files at least 1 MB in size and then sorts them by size. Change M to G for files at least 1 GB in size.

(Caveat: files 1 GB or larger will be missed by the MB version. You can use:

du -h * | egrep "^[0-9.]*(M|G)"

to get both, but then the sort -n doesn't work quite the way we'd like.)

Of course, you could use du without the -h to get file sizes by the default block size rather than the human-readable 12.4M or 16K, etc.

Tagged linux, one-liner and tool.

 

Set monitor resolution with xrandr

$ cvt -r -v 1920 1080
# 1920x1080 59.93 Hz (CVT 2.07M9-R) hsync: 66.59 kHz; pclk: 138.50 MHz
Modeline "1920x1080R"  138.50  1920 1968 2000 2080  1080 1083 1088 1111 +hsync -vsync

$ xrandr --newmode "1920x1080R"  138.50  1920 1968 2000 2080  1080 1083 1088 1111 +hsync -vsync

$ xrandr --addmode VGA1 "1920x1080R"

$ xrandr --output VGA1 --mode "1920x1080R"

Also handy:

$ xrandr --output LVDS1 --off --output VGA1 --auto
Tagged linux, debian and tool.

 

Strip characters from a field in 'awk'

E.g., the following command strips alpha characters from the second (tab delimited) field.

awk -F"\t" '{gsub(/[A-Za-z]/,"",$2); print $2 }'
Tagged linux, awk, tool and one-liner.

 

Strip characters from a string or file with 'sed'

$ echo "A1B2C3" | sed 's/[A-Z]//g'
123
Tagged linux, sed, tool and one-liner.

 

Some 'awk' basics

Extract tab delimited fields from a file:

$ awk -F"\t" '{print "field one=" $1 "; field two=" $2 }' file
Tagged linux, awk, tool and cheatsheet.

 

Skip the first N lines in file

using tail

To skip the first line of a file (and start piping data at the second line):

tail -n +2 <FILENAME>

More generally:

tail -n +M <FILENAME>

where M is the number of the first line you want to see (i.e., the number of lines to skip plus one).

using sed

To skip the first line of a file (and start piping data at the second line):

sed 1d <FILENAME>

More generally:

sed A,Bd <FILENAME>

when you want to exclude lines A through B from the output.

Tagged linux, sed and tool.

 

List Available Fonts

To view a list of available fonts, use fc-list.

Tagged linux, debian and tool.

 

Short list of language names recognized by pygments.

pygments language identifiers I use or have had to look up at one time or another.

  • Antlr-Ruby - antlr-ruby/antlr-rb
  • awk - awk/gawk/mawk/nawk
  • Bash - bash/sh/ksh for shell scripts, console for interactive session captures
  • Clojure - clj/closure
  • CoffeeScript - coffee-script/coffeescript
  • CSS - css
  • diff output - diff/ udiff
  • Haml/Sass/Scss - haml, sass, scss
  • HTML - html
  • HTTP transcripts - http
  • JavaScript - js/javascript
  • JSON - json
  • Lisp - cl/common-lisp
  • make - make/makefile/mf, cmake, basemake, bsdmake
  • nginx configuration files - ngnix
  • Postscript - postscript
  • Ruby - ruby for .rb files, irb for interactive console captures
  • Scheme - scm/scheme
  • SQL - sql, mysql, psql, postgresql-console/postgres-console, sqlite3
  • TeX/LaTeX - tex, latex
  • Text - text (the no-op highlighter)
  • XML/XSLT/XQuery - xml, xslt, xquery
  • Yaml - yaml

Also see the list of languages supported by Pygments and the list of lexers included with Pygments.


 

Launch an HTTP server serving the current directory using Python

The Python SimpleHTTPServer module makes it easy to launch a simple web server using a current working directory as the "docroot".

With Python 2:

python -m SimpleHTTPServer

or with Python 3:

python3 -m http.server

By default, each will bind to port 8080, hence http://localhost:8080/ will serve the top level of the working directory tree. Hit Ctrl-c to stop.

Both accept an optional port number:

python -m SimpleHTTPServer 3001

or

python3 -m http.server 3001

if you want to bind to something other than port 8080.

Published 20 Feb 2014
Tagged python, http, cli, one-liner, ops and tool.

 

backup a git repository with 'git bundle'

Run:

cd REPOSITORY_WORKING_DIRECTORY
git bundle create PATH_TO_BUNDLE.git --all

to create a single-file backup of the entire repository.

Note that the bundle file is a functional Git repository:

git clone PATH_TO_BUNDLE.git MY_PROJECT
Tagged git, backup, one-liner, ops and tool.

 

Uploading a file with curl

To submit the file at foo to a web service as multi-part form data using curl:

curl -X POST -F "file=@\"foo\"" 'https://127.0.0.1/example'

The file part is the name of the corresponding form field.

Note that you can submit multiple files:

curl -X POST -F "f1=@\"foo\"&f2=@\"bar\"" 'https://127.0.0.1/example'

Or add additional body or query string parameters:

curl -X POST -F "f1=@\"foo\"&x=y" 'https://127.0.0.1/example?a=b'
Published 31 Dec 2015
Tagged curl, web and tool.

 

This page was generated at 9:56 PM on 15 Jan 2016.
Copyright © 1999 - 2016 Rodney Waldhoff.