Rodney Waldhoff
HeyRod.com
Email

GitHub
StackOverflow
LinkedIn

15 one-liner snippets

Snippets are tiny notes I've collected for easy reference.

Command-line tool for spidering sites and extracting XML/HTML content

Xidel is a robust tool for spidering, extracting and transforming XML/HTML content from the command line.

It's like wget or curl with a CSS and XPath/XQuery engine (among other features), attached.

xidel doesn't seem to be in the package management repositories I normally use, but you can download it here.

The following example will (1) download a web page, (2) extract a list of links (specified via CSS selector) from it, (3) download the page corresponding to each of those links and finally (4) extract specific pieces of content (specified by CSS selectors) from each page:

xidel [URL-OF-INDEX-PAGE] \
  --follow "css('[CSS-SELECTOR-FOR-LINKS]')" \
  --css "[CSS-SELECTOR-FOR-SOME-TEXT]" \
  --extract "inner-html(css('[CSS-SELECTOR-FOR-SOME-HTML]'))"

As a concrete example, the command:

$ xidel http://reddit.com -f  "css('a')" --css title

will download every page linked from the reddit.com homepage and print the content of its title tag.

There are several more examples on the Xidel site.

Published 11 Feb 2014

Tagged linux, tool, xml, css, html, xpath, one-liner and ops.

Backup or mirror a website using wget

To create a local mirror or backup of a website with wget, run:

wget -r -l 5 -k -w 1 --random-wait <URL>

Where:

-r (or --recursive) will cause wget to recursively download files
-l N (or --level=N) will limit recursion to at most N levels below the root document (defaults to 5, use inf for infinite recursion)
-k (or --convert-links) will cause wget to convert links in the downloaded documents so that the files can be viewed locally
-w (or --wait=N) will cause wget to wait N seconds between requests
--random-wait will cause wget to randomly vary the wait time to 0.5x to 1.5x the value specified by --wait

Some additional notes:

--mirror (or -m) can be used as a shortcut for -r -N -l inf --no-remove-listing which enables infinite recursion and preserves both the server timestamps and FTP directory listings.
-np (--no-parent) can be used to limit wget to files below a specific "directory" (path).

Published 10 Feb 2014

Tagged wget, linux, http, one-liner, web, backup, tool and ops.

Pre-generate pages or load a web cache using wget

Many web frameworks and template engines will defer the generation the HTML version of a document the first time it is accessed. This can make the first hit on a given page significantly slower than subsequent hits.

You can use wget to pre-cache web pages using a command such as:

wget -r -l 3 -nd --delete-after <URL>

Where:

-r (or --recursive) will cause wget to recursively download files
-l N (or --level=N) will limit recursion to at most N levels below the root document (defaults to 5, use inf for infinite recursion)
-nd (or --no-directories) will prevent wget from creating local directories to match the server-side paths
--delete-after will cause wget to delete each file as soon as it is downloaded (so the command leaves no traces behind.)

Published 10 Feb 2014

Tagged wget, linux, http, one-liner, performance, web, tool and ops.

Mapping port 80 to port 3000 using iptables

Port numbers less that 1024 are considered "privileged" ports, and you generally must be root to bind a listener to them.

Rather than running a network application as root, map the privileged port to a non-privileged one:

sudo iptables -A PREROUTING -t nat -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000

Now requests to port 80 will be forwarded on to port 3000.

Published 8 Feb 2014

Tagged linux, networking, iptables, one-liner, http, tool and ops.

Quickly render a 'dot' (Graphviz) graph

On Linux and OSX the command:

dot -Txlib mygraph.gv

will quickly launch a lightweight window containing a dot rendering of the graph in mygraph.gv.

The rendering should automatically refresh when mygraph.gv is updated. (I've occasionally run into small glitches with this that force me to re-launch the window, but they are rare and obvious.)

The same -Txlib parameter works for the other Graphviz rendering engines, including neato, twopi, fdp, sfdp, circo, and patchwork.

Published 1 Jan 2014

Tagged graphviz, linux, one-liner and tool.

Use 'less -S' for horizontal scrolling

The flag -S (or --chop-long-lines) will cause less to truncate lines at the screen (terminal) boundary, rather than wrapping as it does by default. You can then scroll horizontally (with the arrow keys, for example) to view the full lines when needed.

cat some_file_with_very_long_lines | less -S

Tagged bash, linux, one-liner and cli.

Find duplicate files on Linux (or OSX).

Find files that have the same size and MD5 hash (and hence are likely to be exact duplicates):

find -not -empty -type f -printf "%s\n" | \         # line 1
  sort -rn | \                                      # line 2
  uniq -d | \                                       # line 3
  xargs -I{} -n1 find -type f -size {}c -print0 | \ # line 4
  xargs -0 md5sum | \                               # line 5
  sort | \                                          # line 6
  uniq -w32 --all-repeated=separate | \             # line 7
  cut -d" " -f3-                                    # line 8

You probably want to pipe that to a file as it runs slowly.

Line 1 enumerates the real files non-empty by size.
Line 2 sorts the sizes (as numbers of descending size).
Line 3 strips out the lines (sizes) that only appear once.
For each remaining size, line 4 finds all the files of that size.
Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
Line 6 sorts that list for easy comparison.
Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
Line 8 spits out the file name and path part of the matching lines.

Some alternative approaches can be found at the original source.

Tagged linux, one-liner and ops.

Generate a random list of words with `shuf`

shuf is (in my experience) a little known GNU utility that selects random lines (or bytes) from a file.

For instance, the command:

shuf -n 3 /usr/share/dict/words

selects three words at random from the words dictionary.

Tagged linux, one-liner and tool.

Find large files on Linux.

UPDATE: Reader Luc Pionchon points out that sort often supports a -h parameter that sorts by "human" numbers, hence:

$ du -h * | sort -h | tail

is probably a better alternative than any of the following (for the systems that support it).

du -h * | grep "^[0-9.]*M" | sort -n

This finds files at least 1 MB in size and then sorts them by size. Change M to G for files at least 1 GB in size.

(Caveat: files 1 GB or larger will be missed by the MB version. You can use:

du -h * | egrep "^[0-9.]*(M|G)"

to get both, but then the sort -n doesn't work quite the way we'd like.)

Of course, you could use du without the -h to get file sizes by the default block size rather than the human-readable 12.4M or 16K, etc.

Tagged linux, one-liner and tool.

Strip characters from a field in 'awk'

E.g., the following command strips alpha characters from the second (tab delimited) field.

awk -F"\t" '{gsub(/[A-Za-z]/,"",$2); print $2 }'

Tagged linux, awk, tool and one-liner.

Strip characters from a string or file with 'sed'

$ echo "A1B2C3" | sed 's/[A-Z]//g'
123

Tagged linux, sed, tool and one-liner.

Pretty-print JSON with Python's json.tool

Pretty-print a JSON file using Python (v2.5+)'s built-in json.tool module:

cat MYFILE.json | python -m json.tool

Published 15 Feb 2014

Tagged python, json, cli and one-liner.

Python one-liner for reading a CSV file into a JSON array of arrays

Reading a CSV file into 2-d Python array (an array of arrays):

import csv
array = list(csv.reader(open( MYFILE.csv )))

Dumping that as JSON (via the command-line):

$ python -c "import json,csv;print json.dumps(list(csv.reader(open( CSV-FILENAME ))))"

Published 15 Feb 2014

Tagged python, json, cli and one-liner.

Launch an HTTP server serving the current directory using Python

The Python SimpleHTTPServer module makes it easy to launch a simple web server using a current working directory as the "docroot".

With Python 2:

python -m SimpleHTTPServer

or with Python 3:

python3 -m http.server

By default, each will bind to port 8080, hence http://localhost:8080/ will serve the top level of the working directory tree. Hit Ctrl-c to stop.

Both accept an optional port number:

python -m SimpleHTTPServer 3001

python3 -m http.server 3001

if you want to bind to something other than port 8080.

Published 20 Feb 2014

Tagged python, http, cli, one-liner, ops and tool.

backup a git repository with 'git bundle'

Run:

cd REPOSITORY_WORKING_DIRECTORY
git bundle create PATH_TO_BUNDLE.git --all

to create a single-file backup of the entire repository.

Note that the bundle file is a functional Git repository:

git clone PATH_TO_BUNDLE.git MY_PROJECT

Tagged git, backup, one-liner, ops and tool.