Find duplicate files on Linux (or OSX).

Find files that have the same size and MD5 hash (and hence are likely to be exact duplicates):

find -not -empty -type f -printf "%s\n" | \         # line 1
  sort -rn | \                                      # line 2
  uniq -d | \                                       # line 3
  xargs -I{} -n1 find -type f -size {}c -print0 | \ # line 4
  xargs -0 md5sum | \                               # line 5
  sort | \                                          # line 6
  uniq -w32 --all-repeated=separate | \             # line 7
  cut -d" " -f3-                                    # line 8

You probably want to pipe that to a file as it runs slowly.

  1. Line 1 enumerates the real files non-empty by size.
  2. Line 2 sorts the sizes (as numbers of descending size).
  3. Line 3 strips out the lines (sizes) that only appear once.
  4. For each remaining size, line 4 finds all the files of that size.
  5. Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
  6. Line 6 sorts that list for easy comparison.
  7. Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
  8. Line 8 spits out the file name and path part of the matching lines.

Some alternative approaches can be found at the original source.

Tagged linux, one-liner and ops.

 

This page was generated at 4:16 PM on 26 Feb 2018.
Copyright © 1999 - 2018 Rodney Waldhoff.