I have duplicate photos in my image library. We all do. I want to weed them out. The trouble with using straight md5 for this, though, is that EXIF data on JPEG files may be altered by my photo management tool – thus they contain the exact same photo data, but the associated extra data (date/time, ICC profile, tags, etc) causes pure checksum comparison to fail.
Here’s a Perl script which iterates through a folder list and sends all files off for md5 digest. However, any jp(e)g files are first run through jpegtran and saved as a temporary file, so they can be “normalized” (i.e. convert to optimized, progressive, and EXIF-stripped) so the md5 is performed on just image data. This should find duplicates regardless of image program tampering.
#!/usr/bin/perl -w
use strict;
# path to jpegtran
my $JPEGTRAN_LOC = '/Users/grkenn/Pictures/jpegtran';
# Somewhat Advanced Photo Dupe Finder
# Greg Kennedy 2012
# Identifies duplicate photos by image data:
# strips EXIF info and converts to optimize + progressive
# before performing MD5 on image data
# Requires "jpegtran" application from libjpeg project
# Mac users: http://www.phpied.com/installing-jpegtran-mac-unix-linux/
use File::Find;
use Digest::MD5;
my %fingerprint;
my $ctx = Digest::MD5->new;
sub process
{
my $filename = $_;
# file is a directory
if (-d $filename) { return; }
# file is an OSX hidden resource fork
if ($filename =~ m/^\._/) { return; }
if ($filename =~ m/\.jpe?g$/i) {
# attempt to use jpegtran to "normalize" jpg files
if (system("$JPEGTRAN_LOC -copy none -optimize -progressive -outfile /tmp/find_dupe.jpg \"$filename\"")) {
print STDERR "\tError normalizing file " . $File::Find::name . "\n\n";
} else {
$filename = '/tmp/find_dupe.jpg';
}
}
# open file
open (FP, $filename) or die "Couldn't open $filename (source " . $File::Find::name . "): $!\n";
binmode(FP);
# MD5 digest on file
$ctx->addfile(*FP);
push (@{$fingerprint{$ctx->digest}}, $File::Find::name);
close(FP);
}
## Main script
if (scalar @ARGV == 0)
{
print "Usage: ./find_dupe.pl [ ...]\n";
print "\tjpegtran MUST be in the path,\n";
print "\tor edit the script and set JPEGTRAN_LOC to an absolute location\n";
exit;
}
find(\&process, @ARGV);
print "Duplicates report:\n";
foreach my $md5sum (keys %fingerprint)
{
if (scalar @{$fingerprint{$md5sum}} > 1)
{
print "--------------------\n";
foreach my $fname (@{$fingerprint{$md5sum}})
{
print $fname . "\n";
}
}
}
The output looks something like this:
macmini:Pictures grkenn$ ./find_dupe.pl test_lib/ Duplicates report: -------------------- test_lib/ufo or moon.jpg test_lib/subdirectory/dupe_7.jpg -------------------- test_lib/too cool jenny.jpg test_lib/subdirectory/dupe1.jpg test_lib/subdirectory/dupe2.jpg
