{"id":148,"date":"2012-03-20T20:00:26","date_gmt":"2012-03-20T20:00:26","guid":{"rendered":"http:\/\/greg-kennedy.com\/?p=148"},"modified":"2012-03-22T01:12:08","modified_gmt":"2012-03-22T01:12:08","slug":"duplicate-image-finder-in-perl","status":"publish","type":"post","link":"https:\/\/greg-kennedy.com\/wordpress\/2012\/03\/20\/duplicate-image-finder-in-perl\/","title":{"rendered":"Duplicate Image Finder (in Perl)"},"content":{"rendered":"<p>I have duplicate photos in my image library. \u00a0We all do. \u00a0I want to weed them out. \u00a0The trouble with using straight md5 for this, though, is that EXIF data on JPEG files may be altered by my photo management tool &#8211; thus they contain the exact same photo data, but the associated extra data (date\/time, ICC profile, tags, etc) causes pure checksum comparison to fail.<\/p>\n<p>Here&#8217;s a Perl script which iterates through a folder list and sends all files off for md5 digest. \u00a0However, any jp(e)g files are first run through jpegtran and saved as a temporary file, so they can be &#8220;normalized&#8221; (i.e. convert to optimized, progressive, and EXIF-stripped) so the md5 is performed on just image data. \u00a0This should find duplicates regardless of image program tampering.<\/p>\n<pre>#!\/usr\/bin\/perl -w\r\nuse strict;\r\n\r\n# path to jpegtran\r\nmy $JPEGTRAN_LOC = '\/Users\/grkenn\/Pictures\/jpegtran';\r\n\r\n# Somewhat Advanced Photo Dupe Finder\r\n# Greg Kennedy 2012\r\n\r\n# Identifies duplicate photos by image data:\r\n# strips EXIF info and converts to optimize + progressive\r\n# before performing MD5 on image data\r\n\r\n# Requires \"jpegtran\" application from libjpeg project\r\n# Mac users: http:\/\/www.phpied.com\/installing-jpegtran-mac-unix-linux\/\r\n\r\nuse File::Find;\r\nuse Digest::MD5;\r\n\r\nmy %fingerprint;\r\n\r\nmy $ctx = Digest::MD5-&gt;new;\r\n\r\nsub process\r\n{\r\n  my $filename = $_;\r\n\r\n  # file is a directory\r\n  if (-d $filename) { return; }\r\n  # file is an OSX hidden resource fork\r\n  if ($filename =~ m\/^\\._\/) { return; }\r\n\r\n  if ($filename =~ m\/\\.jpe?g$\/i) {\r\n    # attempt to use jpegtran to \"normalize\" jpg files\r\n    if (system(\"$JPEGTRAN_LOC -copy none -optimize -progressive -outfile \/tmp\/find_dupe.jpg \\\"$filename\\\"\")) {\r\n      print STDERR \"\\tError normalizing file \" . $File::Find::name . \"\\n\\n\";\r\n    } else {\r\n      $filename = '\/tmp\/find_dupe.jpg';\r\n    }\r\n  }\r\n\r\n  # open file\r\n  open (FP, $filename) or die \"Couldn't open $filename (source \" . $File::Find::name . \"): $!\\n\";\r\n  binmode(FP);\r\n  # MD5 digest on file\r\n  $ctx-&gt;addfile(*FP);\r\n  push (@{$fingerprint{$ctx-&gt;digest}}, $File::Find::name);\r\n  close(FP);\r\n}\r\n\r\n## Main script\r\nif (scalar @ARGV == 0)\r\n{\r\n  print \"Usage: .\/find_dupe.pl [ ...]\\n\";\r\n  print \"\\tjpegtran MUST be in the path,\\n\";\r\n  print \"\\tor edit the script and set JPEGTRAN_LOC to an absolute location\\n\";\r\n  exit;\r\n}\r\n\r\nfind(\\&amp;process, @ARGV);\r\n\r\nprint \"Duplicates report:\\n\";\r\n\r\nforeach my $md5sum (keys %fingerprint)\r\n{\r\n  if (scalar @{$fingerprint{$md5sum}} &gt; 1)\r\n  {\r\n    print \"--------------------\\n\";\r\n    foreach my $fname (@{$fingerprint{$md5sum}})\r\n    {\r\n      print $fname . \"\\n\";\r\n    }\r\n  }\r\n}<\/pre>\n<p>The output looks something like this:<\/p>\n<pre>macmini:Pictures grkenn$ .\/find_dupe.pl test_lib\/\r\nDuplicates report:\r\n--------------------\r\ntest_lib\/ufo or moon.jpg\r\ntest_lib\/subdirectory\/dupe_7.jpg\r\n--------------------\r\ntest_lib\/too cool jenny.jpg\r\ntest_lib\/subdirectory\/dupe1.jpg\r\ntest_lib\/subdirectory\/dupe2.jpg<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I have duplicate photos in my image library. \u00a0We all do. \u00a0I want to weed them out. \u00a0The trouble with using straight md5 for this, though, is that EXIF data on JPEG files may be altered by my photo management tool &#8211; thus they contain the exact same photo data, but the associated extra data [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":["post-148","post","type-post","status-publish","format-standard","hentry","category-software"],"_links":{"self":[{"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/posts\/148","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/comments?post=148"}],"version-history":[{"count":8,"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/posts\/148\/revisions"}],"predecessor-version":[{"id":156,"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/posts\/148\/revisions\/156"}],"wp:attachment":[{"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/media?parent=148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/categories?post=148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/greg-kennedy.com\/wordpress\/wp-json\/wp\/v2\/tags?post=148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}