15 one-liner snippets
Command-line tool for spidering sites and extracting XML/HTML content
Xidel is a robust tool for spidering, extracting and transforming XML/HTML content from the command line.
It's like wget
or curl
with a CSS and XPath/XQuery engine (among other features), attached.
xidel
doesn't seem to be in the package management repositories I normally use, but you can download it here.
The following example will (1) download a web page, (2) extract a list of links (specified via CSS selector) from it, (3) download the page corresponding to each of those links and finally (4) extract specific pieces of content (specified by CSS selectors) from each page:
xidel [URL-OF-INDEX-PAGE] \
--follow "css('[CSS-SELECTOR-FOR-LINKS]')" \
--css "[CSS-SELECTOR-FOR-SOME-TEXT]" \
--extract "inner-html(css('[CSS-SELECTOR-FOR-SOME-HTML]'))"
As a concrete example, the command:
$ xidel http://reddit.com -f "css('a')" --css title
will download every page linked from the reddit.com homepage and print the content of its title
tag.
There are several more examples on the Xidel site.
Backup or mirror a website using wget
To create a local mirror or backup of a website with wget
, run:
wget -r -l 5 -k -w 1 --random-wait <URL>
Where:
-r
(or--recursive
) will causewget
to recursively download files-l N
(or--level=N
) will limit recursion to at most N levels below the root document (defaults to 5, useinf
for infinite recursion)-k
(or--convert-links
) will causewget
to convert links in the downloaded documents so that the files can be viewed locally-w
(or--wait=N
) will causewget
to wait N seconds between requests--random-wait
will causewget
to randomly vary the wait time to0.5x
to1.5x
the value specified by--wait
Some additional notes:
--mirror
(or-m
) can be used as a shortcut for-r -N -l inf --no-remove-listing
which enables infinite recursion and preserves both the server timestamps and FTP directory listings.-np
(--no-parent
) can be used to limitwget
to files below a specific "directory" (path).
Pre-generate pages or load a web cache using wget
Many web frameworks and template engines will defer the generation the HTML version of a document the first time it is accessed. This can make the first hit on a given page significantly slower than subsequent hits.
You can use wget
to pre-cache web pages using a command such as:
wget -r -l 3 -nd --delete-after <URL>
Where:
-r
(or--recursive
) will causewget
to recursively download files-l N
(or--level=N
) will limit recursion to at most N levels below the root document (defaults to 5, useinf
for infinite recursion)-nd
(or--no-directories
) will preventwget
from creating local directories to match the server-side paths--delete-after
will causewget
to delete each file as soon as it is downloaded (so the command leaves no traces behind.)
Mapping port 80 to port 3000 using iptables
Port numbers less that 1024 are considered "privileged" ports, and you generally must be root
to bind a listener to them.
Rather than running a network application as root
, map the privileged port to a non-privileged one:
sudo iptables -A PREROUTING -t nat -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000
Now requests to port 80 will be forwarded on to port 3000.
Quickly render a 'dot' (Graphviz) graph
On Linux and OSX the command:
dot -Txlib mygraph.gv
will quickly launch a lightweight window containing a dot
rendering of the graph in mygraph.gv
.
The rendering should automatically refresh when mygraph.gv
is updated. (I've occasionally run into small glitches with this that force me to re-launch the window, but they are rare and obvious.)
The same -Txlib
parameter works for the other Graphviz rendering engines, including neato
, twopi
, fdp
, sfdp
, circo
, and patchwork
.
Use 'less -S' for horizontal scrolling
The flag -S
(or --chop-long-lines
) will cause less
to truncate lines at the screen (terminal) boundary, rather than wrapping as it does by default. You can then scroll horizontally (with the arrow keys, for example) to view the full lines when needed.
cat some_file_with_very_long_lines | less -S
Find duplicate files on Linux (or OSX).
Find files that have the same size and MD5 hash (and hence are likely to be exact duplicates):
find -not -empty -type f -printf "%s\n" | \ # line 1
sort -rn | \ # line 2
uniq -d | \ # line 3
xargs -I{} -n1 find -type f -size {}c -print0 | \ # line 4
xargs -0 md5sum | \ # line 5
sort | \ # line 6
uniq -w32 --all-repeated=separate | \ # line 7
cut -d" " -f3- # line 8
You probably want to pipe that to a file as it runs slowly.
- Line 1 enumerates the real files non-empty by size.
- Line 2 sorts the sizes (as numbers of descending size).
- Line 3 strips out the lines (sizes) that only appear once.
- For each remaining size, line 4 finds all the files of that size.
- Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
- Line 6 sorts that list for easy comparison.
- Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
- Line 8 spits out the file name and path part of the matching lines.
Some alternative approaches can be found at the original source.
Find large files on Linux.
UPDATE: Reader Luc Pionchon points out that sort
often supports a -h
parameter that sorts by "human" numbers, hence:
$ du -h * | sort -h | tail
is probably a better alternative than any of the following (for the systems that support it).
du -h * | grep "^[0-9.]*M" | sort -n
This finds files at least 1 MB in size and then sorts them by size. Change M
to G
for files at least 1 GB in size.
(Caveat: files 1 GB or larger will be missed by the MB version. You can use:
du -h * | egrep "^[0-9.]*(M|G)"
to get both, but then the sort -n
doesn't work quite the way we'd like.)
Of course, you could use du
without the -h
to get file sizes by the default block size rather than the human-readable 12.4M or 16K, etc.
Python one-liner for reading a CSV file into a JSON array of arrays
Reading a CSV file into 2-d Python array (an array of arrays):
import csv
array = list(csv.reader(open( MYFILE.csv )))
Dumping that as JSON (via the command-line):
$ python -c "import json,csv;print json.dumps(list(csv.reader(open( CSV-FILENAME ))))"
Launch an HTTP server serving the current directory using Python
The Python SimpleHTTPServer
module makes it easy to launch a simple web server using a current working directory as the "docroot".
With Python 2:
python -m SimpleHTTPServer
or with Python 3:
python3 -m http.server
By default, each will bind to port 8080, hence http://localhost:8080/
will serve the top level of the working directory tree. Hit Ctrl-c
to stop.
Both accept an optional port number:
python -m SimpleHTTPServer 3001
or
python3 -m http.server 3001
if you want to bind to something other than port 8080.