Duplicate Image Finder (in Perl)

I have duplicate photos in my image library.  We all do.  I want to weed them out.  The trouble with using straight md5 for this, though, is that EXIF data on JPEG files may be altered by my photo management tool – thus they contain the exact same photo data, but the associated extra data (date/time, ICC profile, tags, etc) causes pure checksum comparison to fail.

Here’s a Perl script which iterates through a folder list and sends all files off for md5 digest.  However, any jp(e)g files are first run through jpegtran and saved as a temporary file, so they can be “normalized” (i.e. convert to optimized, progressive, and EXIF-stripped) so the md5 is performed on just image data.  This should find duplicates regardless of image program tampering.

#!/usr/bin/perl -w
use strict;

# path to jpegtran
my $JPEGTRAN_LOC = '/Users/grkenn/Pictures/jpegtran';

# Somewhat Advanced Photo Dupe Finder
# Greg Kennedy 2012

# Identifies duplicate photos by image data:
# strips EXIF info and converts to optimize + progressive
# before performing MD5 on image data

# Requires "jpegtran" application from libjpeg project
# Mac users: http://www.phpied.com/installing-jpegtran-mac-unix-linux/

use File::Find;
use Digest::MD5;

my %fingerprint;

my $ctx = Digest::MD5->new;

sub process
{
  my $filename = $_;

  # file is a directory
  if (-d $filename) { return; }
  # file is an OSX hidden resource fork
  if ($filename =~ m/^\._/) { return; }

  if ($filename =~ m/\.jpe?g$/i) {
    # attempt to use jpegtran to "normalize" jpg files
    if (system("$JPEGTRAN_LOC -copy none -optimize -progressive -outfile /tmp/find_dupe.jpg \"$filename\"")) {
      print STDERR "\tError normalizing file " . $File::Find::name . "\n\n";
    } else {
      $filename = '/tmp/find_dupe.jpg';
    }
  }

  # open file
  open (FP, $filename) or die "Couldn't open $filename (source " . $File::Find::name . "): $!\n";
  binmode(FP);
  # MD5 digest on file
  $ctx->addfile(*FP);
  push (@{$fingerprint{$ctx->digest}}, $File::Find::name);
  close(FP);
}

## Main script
if (scalar @ARGV == 0)
{
  print "Usage: ./find_dupe.pl [ ...]\n";
  print "\tjpegtran MUST be in the path,\n";
  print "\tor edit the script and set JPEGTRAN_LOC to an absolute location\n";
  exit;
}

find(\&process, @ARGV);

print "Duplicates report:\n";

foreach my $md5sum (keys %fingerprint)
{
  if (scalar @{$fingerprint{$md5sum}} > 1)
  {
    print "--------------------\n";
    foreach my $fname (@{$fingerprint{$md5sum}})
    {
      print $fname . "\n";
    }
  }
}

The output looks something like this:

macmini:Pictures grkenn$ ./find_dupe.pl test_lib/
Duplicates report:
--------------------
test_lib/ufo or moon.jpg
test_lib/subdirectory/dupe_7.jpg
--------------------
test_lib/too cool jenny.jpg
test_lib/subdirectory/dupe1.jpg
test_lib/subdirectory/dupe2.jpg

Leave a Reply

Your email address will not be published. Required fields are marked *