19 tool snippets
Command-line tool for spidering sites and extracting XML/HTML content
Xidel is a robust tool for spidering, extracting and transforming XML/HTML content from the command line.
It's like wget
or curl
with a CSS and XPath/XQuery engine (among other features), attached.
xidel
doesn't seem to be in the package management repositories I normally use, but you can download it here.
The following example will (1) download a web page, (2) extract a list of links (specified via CSS selector) from it, (3) download the page corresponding to each of those links and finally (4) extract specific pieces of content (specified by CSS selectors) from each page:
xidel [URL-OF-INDEX-PAGE] \
--follow "css('[CSS-SELECTOR-FOR-LINKS]')" \
--css "[CSS-SELECTOR-FOR-SOME-TEXT]" \
--extract "inner-html(css('[CSS-SELECTOR-FOR-SOME-HTML]'))"
As a concrete example, the command:
$ xidel http://reddit.com -f "css('a')" --css title
will download every page linked from the reddit.com homepage and print the content of its title
tag.
There are several more examples on the Xidel site.
Backup or mirror a website using wget
To create a local mirror or backup of a website with wget
, run:
wget -r -l 5 -k -w 1 --random-wait <URL>
Where:
-r
(or--recursive
) will causewget
to recursively download files-l N
(or--level=N
) will limit recursion to at most N levels below the root document (defaults to 5, useinf
for infinite recursion)-k
(or--convert-links
) will causewget
to convert links in the downloaded documents so that the files can be viewed locally-w
(or--wait=N
) will causewget
to wait N seconds between requests--random-wait
will causewget
to randomly vary the wait time to0.5x
to1.5x
the value specified by--wait
Some additional notes:
--mirror
(or-m
) can be used as a shortcut for-r -N -l inf --no-remove-listing
which enables infinite recursion and preserves both the server timestamps and FTP directory listings.-np
(--no-parent
) can be used to limitwget
to files below a specific "directory" (path).
Pre-generate pages or load a web cache using wget
Many web frameworks and template engines will defer the generation the HTML version of a document the first time it is accessed. This can make the first hit on a given page significantly slower than subsequent hits.
You can use wget
to pre-cache web pages using a command such as:
wget -r -l 3 -nd --delete-after <URL>
Where:
-r
(or--recursive
) will causewget
to recursively download files-l N
(or--level=N
) will limit recursion to at most N levels below the root document (defaults to 5, useinf
for infinite recursion)-nd
(or--no-directories
) will preventwget
from creating local directories to match the server-side paths--delete-after
will causewget
to delete each file as soon as it is downloaded (so the command leaves no traces behind.)
Mapping port 80 to port 3000 using iptables
Port numbers less that 1024 are considered "privileged" ports, and you generally must be root
to bind a listener to them.
Rather than running a network application as root
, map the privileged port to a non-privileged one:
sudo iptables -A PREROUTING -t nat -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000
Now requests to port 80 will be forwarded on to port 3000.
Quickly render a 'dot' (Graphviz) graph
On Linux and OSX the command:
dot -Txlib mygraph.gv
will quickly launch a lightweight window containing a dot
rendering of the graph in mygraph.gv
.
The rendering should automatically refresh when mygraph.gv
is updated. (I've occasionally run into small glitches with this that force me to re-launch the window, but they are rare and obvious.)
The same -Txlib
parameter works for the other Graphviz rendering engines, including neato
, twopi
, fdp
, sfdp
, circo
, and patchwork
.
Backup an SD card on Linux using 'dd'
#!/bin/bash
if [ -b "/dev/$1" ]
then
outfile="sdcard-backup-`date +"%s"`.dd"
echo "cloning /dev/$1 to $outfile"
dd if=/dev/$1 of=$outfile
echo "tgz-ing $outfile"
tar zcvf $outfile.tgz $outfile
echo "done."
else
echo "Usage: $0 /dev/<device>"
fi
echo "to restore, unmount(?), then use:"
echo "tar Ozxf <file> | dd of=<device>"
Find large files on Linux.
UPDATE: Reader Luc Pionchon points out that sort
often supports a -h
parameter that sorts by "human" numbers, hence:
$ du -h * | sort -h | tail
is probably a better alternative than any of the following (for the systems that support it).
du -h * | grep "^[0-9.]*M" | sort -n
This finds files at least 1 MB in size and then sorts them by size. Change M
to G
for files at least 1 GB in size.
(Caveat: files 1 GB or larger will be missed by the MB version. You can use:
du -h * | egrep "^[0-9.]*(M|G)"
to get both, but then the sort -n
doesn't work quite the way we'd like.)
Of course, you could use du
without the -h
to get file sizes by the default block size rather than the human-readable 12.4M or 16K, etc.
Set monitor resolution with xrandr
$ cvt -r -v 1920 1080
# 1920x1080 59.93 Hz (CVT 2.07M9-R) hsync: 66.59 kHz; pclk: 138.50 MHz
Modeline "1920x1080R" 138.50 1920 1968 2000 2080 1080 1083 1088 1111 +hsync -vsync
$ xrandr --newmode "1920x1080R" 138.50 1920 1968 2000 2080 1080 1083 1088 1111 +hsync -vsync
$ xrandr --addmode VGA1 "1920x1080R"
$ xrandr --output VGA1 --mode "1920x1080R"
Also handy:
$ xrandr --output LVDS1 --off --output VGA1 --auto
Skip the first N lines in file
using tail
To skip the first line of a file (and start piping data at the second line):
tail -n +2 <FILENAME>
More generally:
tail -n +M <FILENAME>
where M
is the number of the first line you want to see (i.e., the number of lines to skip plus one).
using sed
To skip the first line of a file (and start piping data at the second line):
sed 1d <FILENAME>
More generally:
sed A,Bd <FILENAME>
when you want to exclude lines A
through B
from the output.
Short list of language names recognized by pygments.
pygments
language identifiers I use or have had to look up at one time or another.
- Antlr-Ruby -
antlr-ruby
/antlr-rb
- awk -
awk
/gawk
/mawk
/nawk
- Bash -
bash
/sh
/ksh
for shell scripts,console
for interactive session captures - Clojure -
clj
/closure
- CoffeeScript -
coffee-script
/coffeescript
- CSS -
css
- diff output -
diff
/udiff
- Haml/Sass/Scss -
haml
,sass
,scss
- HTML -
html
- HTTP transcripts -
http
- JavaScript -
js
/javascript
- JSON -
json
- Lisp -
cl
/common-lisp
- make -
make
/makefile
/mf
,cmake
,basemake
,bsdmake
- nginx configuration files -
ngnix
- Postscript -
postscript
- Ruby -
ruby
for .rb files,irb
for interactive console captures - Scheme -
scm
/scheme
- SQL -
sql
,mysql
,psql
,postgresql-console
/postgres-console
,sqlite3
- TeX/LaTeX -
tex
,latex
- Text -
text
(the no-op highlighter) - XML/XSLT/XQuery -
xml
,xslt
,xquery
- Yaml -
yaml
Also see the list of languages supported by Pygments and the list of lexers included with Pygments.
Launch an HTTP server serving the current directory using Python
The Python SimpleHTTPServer
module makes it easy to launch a simple web server using a current working directory as the "docroot".
With Python 2:
python -m SimpleHTTPServer
or with Python 3:
python3 -m http.server
By default, each will bind to port 8080, hence http://localhost:8080/
will serve the top level of the working directory tree. Hit Ctrl-c
to stop.
Both accept an optional port number:
python -m SimpleHTTPServer 3001
or
python3 -m http.server 3001
if you want to bind to something other than port 8080.
Uploading a file with curl
To submit the file at foo
to a web service as multi-part form data using curl
:
curl -X POST -F "file=@\"foo\"" 'https://127.0.0.1/example'
The file
part is the name of the corresponding form field.
Note that you can submit multiple files:
curl -X POST -F "f1=@\"foo\"&f2=@\"bar\"" 'https://127.0.0.1/example'
Or add additional body or query string parameters:
curl -X POST -F "f1=@\"foo\"&x=y" 'https://127.0.0.1/example?a=b'