19 tool snippets
Command-line tool for spidering sites and extracting XML/HTML content
Xidel is a robust tool for spidering, extracting and transforming XML/HTML content from the command line.
It's like wget or curl with a CSS and XPath/XQuery engine (among other features), attached.
xidel doesn't seem to be in the package management repositories I normally use, but you can download it here.
The following example will (1) download a web page, (2) extract a list of links (specified via CSS selector) from it, (3) download the page corresponding to each of those links and finally (4) extract specific pieces of content (specified by CSS selectors) from each page:
xidel [URL-OF-INDEX-PAGE] \
--follow "css('[CSS-SELECTOR-FOR-LINKS]')" \
--css "[CSS-SELECTOR-FOR-SOME-TEXT]" \
--extract "inner-html(css('[CSS-SELECTOR-FOR-SOME-HTML]'))"
As a concrete example, the command:
$ xidel http://reddit.com -f "css('a')" --css title
will download every page linked from the reddit.com homepage and print the content of its title tag.
There are several more examples on the Xidel site.
Backup or mirror a website using wget
To create a local mirror or backup of a website with wget, run:
wget -r -l 5 -k -w 1 --random-wait <URL>
Where:
-r(or--recursive) will causewgetto recursively download files-l N(or--level=N) will limit recursion to at most N levels below the root document (defaults to 5, useinffor infinite recursion)-k(or--convert-links) will causewgetto convert links in the downloaded documents so that the files can be viewed locally-w(or--wait=N) will causewgetto wait N seconds between requests--random-waitwill causewgetto randomly vary the wait time to0.5xto1.5xthe value specified by--wait
Some additional notes:
--mirror(or-m) can be used as a shortcut for-r -N -l inf --no-remove-listingwhich enables infinite recursion and preserves both the server timestamps and FTP directory listings.-np(--no-parent) can be used to limitwgetto files below a specific "directory" (path).
Pre-generate pages or load a web cache using wget
Many web frameworks and template engines will defer the generation the HTML version of a document the first time it is accessed. This can make the first hit on a given page significantly slower than subsequent hits.
You can use wget to pre-cache web pages using a command such as:
wget -r -l 3 -nd --delete-after <URL>
Where:
-r(or--recursive) will causewgetto recursively download files-l N(or--level=N) will limit recursion to at most N levels below the root document (defaults to 5, useinffor infinite recursion)-nd(or--no-directories) will preventwgetfrom creating local directories to match the server-side paths--delete-afterwill causewgetto delete each file as soon as it is downloaded (so the command leaves no traces behind.)
Mapping port 80 to port 3000 using iptables
Port numbers less that 1024 are considered "privileged" ports, and you generally must be root to bind a listener to them.
Rather than running a network application as root, map the privileged port to a non-privileged one:
sudo iptables -A PREROUTING -t nat -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000
Now requests to port 80 will be forwarded on to port 3000.
Quickly render a 'dot' (Graphviz) graph
On Linux and OSX the command:
dot -Txlib mygraph.gv
will quickly launch a lightweight window containing a dot rendering of the graph in mygraph.gv.
The rendering should automatically refresh when mygraph.gv is updated. (I've occasionally run into small glitches with this that force me to re-launch the window, but they are rare and obvious.)
The same -Txlib parameter works for the other Graphviz rendering engines, including neato, twopi, fdp, sfdp, circo, and patchwork.
Backup an SD card on Linux using 'dd'
#!/bin/bash
if [ -b "/dev/$1" ]
then
outfile="sdcard-backup-`date +"%s"`.dd"
echo "cloning /dev/$1 to $outfile"
dd if=/dev/$1 of=$outfile
echo "tgz-ing $outfile"
tar zcvf $outfile.tgz $outfile
echo "done."
else
echo "Usage: $0 /dev/<device>"
fi
echo "to restore, unmount(?), then use:"
echo "tar Ozxf <file> | dd of=<device>"
Find large files on Linux.
UPDATE: Reader Luc Pionchon points out that sort often supports a -h parameter that sorts by "human" numbers, hence:
$ du -h * | sort -h | tail
is probably a better alternative than any of the following (for the systems that support it).
du -h * | grep "^[0-9.]*M" | sort -n
This finds files at least 1 MB in size and then sorts them by size. Change M to G for files at least 1 GB in size.
(Caveat: files 1 GB or larger will be missed by the MB version. You can use:
du -h * | egrep "^[0-9.]*(M|G)"
to get both, but then the sort -n doesn't work quite the way we'd like.)
Of course, you could use du without the -h to get file sizes by the default block size rather than the human-readable 12.4M or 16K, etc.
Set monitor resolution with xrandr
$ cvt -r -v 1920 1080
# 1920x1080 59.93 Hz (CVT 2.07M9-R) hsync: 66.59 kHz; pclk: 138.50 MHz
Modeline "1920x1080R" 138.50 1920 1968 2000 2080 1080 1083 1088 1111 +hsync -vsync
$ xrandr --newmode "1920x1080R" 138.50 1920 1968 2000 2080 1080 1083 1088 1111 +hsync -vsync
$ xrandr --addmode VGA1 "1920x1080R"
$ xrandr --output VGA1 --mode "1920x1080R"
Also handy:
$ xrandr --output LVDS1 --off --output VGA1 --auto
Skip the first N lines in file
using tail
To skip the first line of a file (and start piping data at the second line):
tail -n +2 <FILENAME>
More generally:
tail -n +M <FILENAME>
where M is the number of the first line you want to see (i.e., the number of lines to skip plus one).
using sed
To skip the first line of a file (and start piping data at the second line):
sed 1d <FILENAME>
More generally:
sed A,Bd <FILENAME>
when you want to exclude lines A through B from the output.
Short list of language names recognized by pygments.
pygments language identifiers I use or have had to look up at one time or another.
- Antlr-Ruby -
antlr-ruby/antlr-rb - awk -
awk/gawk/mawk/nawk - Bash -
bash/sh/kshfor shell scripts,consolefor interactive session captures - Clojure -
clj/closure - CoffeeScript -
coffee-script/coffeescript - CSS -
css - diff output -
diff/udiff - Haml/Sass/Scss -
haml,sass,scss - HTML -
html - HTTP transcripts -
http - JavaScript -
js/javascript - JSON -
json - Lisp -
cl/common-lisp - make -
make/makefile/mf,cmake,basemake,bsdmake - nginx configuration files -
ngnix - Postscript -
postscript - Ruby -
rubyfor .rb files,irbfor interactive console captures - Scheme -
scm/scheme - SQL -
sql,mysql,psql,postgresql-console/postgres-console,sqlite3 - TeX/LaTeX -
tex,latex - Text -
text(the no-op highlighter) - XML/XSLT/XQuery -
xml,xslt,xquery - Yaml -
yaml
Also see the list of languages supported by Pygments and the list of lexers included with Pygments.
Launch an HTTP server serving the current directory using Python
The Python SimpleHTTPServer module makes it easy to launch a simple web server using a current working directory as the "docroot".
With Python 2:
python -m SimpleHTTPServer
or with Python 3:
python3 -m http.server
By default, each will bind to port 8080, hence http://localhost:8080/ will serve the top level of the working directory tree. Hit Ctrl-c to stop.
Both accept an optional port number:
python -m SimpleHTTPServer 3001
or
python3 -m http.server 3001
if you want to bind to something other than port 8080.
Uploading a file with curl
To submit the file at foo to a web service as multi-part form data using curl:
curl -X POST -F "file=@\"foo\"" 'https://127.0.0.1/example'
The file part is the name of the corresponding form field.
Note that you can submit multiple files:
curl -X POST -F "f1=@\"foo\"&f2=@\"bar\"" 'https://127.0.0.1/example'
Or add additional body or query string parameters:
curl -X POST -F "f1=@\"foo\"&x=y" 'https://127.0.0.1/example?a=b'
