Graphical detector and cleanup tool for duplicate images and video
Find a file
2024-05-15 20:01:49 +02:00
findimagedupes.py Add support for hashing videos and displaying them side by side. 2024-05-15 11:25:51 +02:00
hash_cache.py Add a script to dump all entries with duplicate files as a CSV file. 2024-04-18 01:35:00 +02:00
hash_cache_to_csv.py Don't add an extra newline when writing a CSV file. 2024-04-25 17:50:59 +02:00
LICENSE.txt Add license terms to make the licensing more obvious. 2024-04-24 09:09:17 +02:00
merge_gui.py Add support for hashing videos and displaying them side by side. 2024-05-15 11:25:51 +02:00
README.md Add a readme file explaining the tools and how to use them. 2024-04-25 05:28:37 +02:00
requirements.txt Add OpenCV to the requirements. 2024-05-15 20:01:49 +02:00
video_hash.py Add support for hashing videos and displaying them side by side. 2024-05-15 11:25:51 +02:00

Image De-duplicator

findimagedupes is a set of tools for de-duplicating images. It uses image hashing algorithms to find images displaying the same contents. This toolchain consists of 3 different tools which are supposed to be used as stages:

  • findimagedupes recursively searches a list of directories and stores a list of all images found in those directories in a database in the current directory. The database is keyed by image hash.
  • hash_cache_to_csv picks any records from the duplicate database and outputs a CSV file listing all duplicates for a given image hash on a line.
  • merge_gui can read the CSV file mentioned above and will open a comparison dialog for each individual file, with options on how to resolve the duplicate.

The image de-duplicator can find duplicate images even if they

  • have been scaled to different sizes
  • are stored in different image formats (e.g. JPG and HEIC)
  • contain different EXIF tags
  • reside in different directories

It achieves this by comparing the perceptive hash of the images, rather than any properties of the image files.

Installing required packages

findimagedupes comes with a requirements.txt file which lists all dependencies required to run any of the scripts contained herein. You can install them using the following command:

$ pip3 install -r requirements.txt

To avoid dumping random packages into your Python installation, it is recommended to use VirtualEnv to contain all the packages required. You need to create and activate your virtualenv, then run the above pip3 command to install the missing dependencies into your virtual environemnt. As long as the virtual environment is active, findimagedupes scripts should be able to find and use all required dependencies.

Example use case

  1. Analyze a list of directories and generate a hashes.db file.

    $ python3 findimagedupes.py /home/user/Nextcloud /home/user/Pictures
    

    This will create a file hashes.db (or, depending on your operating system, multiple files with the common prefix hashes.db) in the current working directory.

    You can use relative directory paths if you want to, in that case they will be stored as relative paths in the database (and copied as such to the CSV file in the next step).

  2. Dump all duplicates from the hashes.db database into CSV format.

    $ python3 hash_cache_to_csv.py hashes.db hashes.csv
    

    This will create the hashes.csv file you will need for the next step.

  3. Run the merge GUI to go over the list of duplicate images, and determine whether they are really duplicates, and which one you want to keep, or if you want to overwrite one with the other.

    $ python3 merge_gui.py
    

    This will open up a file dialog which will allow you to pick your hashes.csv file (or whatever you named it earlier).

findimagedupes.py

The findimagedupes script performs the main work of the process. It is invoked with a list of directories to iterate and will find all image files inside the directory, and create an image hash.

The image hash is performed on a downscaled 64x64 pixel version of the input image. Two hashes are performed in sequence on each image:

  • A Perceptive Hash, and
  • a wavelet hash.

This has been shown to eliminate most false duplicates, while catching most of the images that are actually duplicate. See the ImageHash documentation for more details on these hashing methods.

findimagedupes will attempt to efficiently use all CPU cores in your PC by default. There is currently no configuration setting to change this; you can change the number of worker threads directly in the script by changing the line with the invocation of multiprocessing.cpu_count() to something else. You can also use tools like sched_setaffinity on Linux or Process Lasso on Windows to restrict the script to specific cores to keep your machine responsive.

Note that the script will temporarily load an image into memory in each worker thread, so using too many threads can exhaust your system memory.

Usage:

python3 findimagedupes.py path/to/directory1 [path/to/directory2 [...]]

hash_cache_to_csv.py

This script is really simple and should finish very quickly. It reads the previously generated hashes.db database and writes a CSV file to the specified output path.

Usage:

python3 hash_cache_to_csv.py path/to/hashes.db path/to/output.csv

merge_gui.py

This script doesn't take any parameters, it is entirely GUI controlled. It is based on PyQt.

  1. Run the merge GUI.

    $ python3 merge_gui.py
    
  2. Find and open the previously generated CSV file containing all duplicates.

  3. For each image displayed,

    • Check that the left and right image are actually the same, otherwise click on Skip.
    • Compare the EXIF metadata differences displayed below the image. If unsure, you may want to decide to keep the file with more metadata.
    • If you want to delete the left or right image, click on the corresponding Keep Right or Keep Left button. Keep Right will delete the file on the left, and Keep Left will delete the image on the right.
    • If you want to keep the file on one side, but in the location specified on the other, you can use the Move left to right or Move right to left buttons. They will delete the picture on the target side, and move the one from the other side into its place.

All deleted files are moved to your systems Recycle Bin, so if you make a mistake, it should be easy to restore a file you deleted by accident.

If you exit the program before the end of the list, don't fret -- when opening the same CSV file again, the merge GUI will skip all lines for which no more files can be found.

Usage:

$ python3 merge_gui.py

License

This selection of scripts is licensed under the 3-clause BSD license, see the LICENSE.txt file for details.