Snippets are tiny notes I've collected for easy reference.
Find duplicate files on Linux (or OSX).
Find files that have the same size and MD5 hash (and hence are likely to be exact duplicates):
find -not -empty -type f -printf "%s\n" | \ # line 1
sort -rn | \ # line 2
uniq -d | \ # line 3
xargs -I{} -n1 find -type f -size {}c -print0 | \ # line 4
xargs -0 md5sum | \ # line 5
sort | \ # line 6
uniq -w32 --all-repeated=separate | \ # line 7
cut -d" " -f3- # line 8
You probably want to pipe that to a file as it runs slowly.
- Line 1 enumerates the real files non-empty by size.
- Line 2 sorts the sizes (as numbers of descending size).
- Line 3 strips out the lines (sizes) that only appear once.
- For each remaining size, line 4 finds all the files of that size.
- Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
- Line 6 sorts that list for easy comparison.
- Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
- Line 8 spits out the file name and path part of the matching lines.
Some alternative approaches can be found at the original source.
Snippets are tiny notes I've collected for easy reference.