I have duplicate photos in my image library. We all do. I want to weed them out. The trouble with using straight md5 for this, though, is that EXIF data on JPEG files may be altered by my photo management tool – thus they contain the exact same photo data, but the associated extra data (date/time, ICC profile, tags, etc) causes pure checksum comparison to fail.
Here’s a Perl script which iterates through a folder list and sends all files off for md5 digest. However, any jp(e)g files are first run through jpegtran and saved as a temporary file, so they can be “normalized” (i.e. convert to optimized, progressive, and EXIF-stripped) so the md5 is performed on just image data. This should find duplicates regardless of image program tampering.
#!/usr/bin/perl -w use strict; # path to jpegtran my $JPEGTRAN_LOC = '/Users/grkenn/Pictures/jpegtran'; # Somewhat Advanced Photo Dupe Finder # Greg Kennedy 2012 # Identifies duplicate photos by image data: # strips EXIF info and converts to optimize + progressive # before performing MD5 on image data # Requires "jpegtran" application from libjpeg project # Mac users: http://www.phpied.com/installing-jpegtran-mac-unix-linux/ use File::Find; use Digest::MD5; my %fingerprint; my $ctx = Digest::MD5->new; sub process { my $filename = $_; # file is a directory if (-d $filename) { return; } # file is an OSX hidden resource fork if ($filename =~ m/^\._/) { return; } if ($filename =~ m/\.jpe?g$/i) { # attempt to use jpegtran to "normalize" jpg files if (system("$JPEGTRAN_LOC -copy none -optimize -progressive -outfile /tmp/find_dupe.jpg \"$filename\"")) { print STDERR "\tError normalizing file " . $File::Find::name . "\n\n"; } else { $filename = '/tmp/find_dupe.jpg'; } } # open file open (FP, $filename) or die "Couldn't open $filename (source " . $File::Find::name . "): $!\n"; binmode(FP); # MD5 digest on file $ctx->addfile(*FP); push (@{$fingerprint{$ctx->digest}}, $File::Find::name); close(FP); } ## Main script if (scalar @ARGV == 0) { print "Usage: ./find_dupe.pl [ ...]\n"; print "\tjpegtran MUST be in the path,\n"; print "\tor edit the script and set JPEGTRAN_LOC to an absolute location\n"; exit; } find(\&process, @ARGV); print "Duplicates report:\n"; foreach my $md5sum (keys %fingerprint) { if (scalar @{$fingerprint{$md5sum}} > 1) { print "--------------------\n"; foreach my $fname (@{$fingerprint{$md5sum}}) { print $fname . "\n"; } } }
The output looks something like this:
macmini:Pictures grkenn$ ./find_dupe.pl test_lib/ Duplicates report: -------------------- test_lib/ufo or moon.jpg test_lib/subdirectory/dupe_7.jpg -------------------- test_lib/too cool jenny.jpg test_lib/subdirectory/dupe1.jpg test_lib/subdirectory/dupe2.jpg