- Python 100%
| findimagedupes.py | ||
| hash_cache.py | ||
| hash_cache_to_csv.py | ||
| LICENSE.txt | ||
| merge_gui.py | ||
| README.md | ||
| requirements.txt | ||
| video_hash.py | ||
Image De-duplicator
findimagedupes is a set of tools for de-duplicating images. It uses image hashing algorithms to find images displaying the same contents. This toolchain consists of 3 different tools which are supposed to be used as stages:
- findimagedupes recursively searches a list of directories and stores a list of all images found in those directories in a database in the current directory. The database is keyed by image hash.
- hash_cache_to_csv picks any records from the duplicate database and outputs a CSV file listing all duplicates for a given image hash on a line.
- merge_gui can read the CSV file mentioned above and will open a comparison dialog for each individual file, with options on how to resolve the duplicate.
The image de-duplicator can find duplicate images even if they
- have been scaled to different sizes
- are stored in different image formats (e.g. JPG and HEIC)
- contain different EXIF tags
- reside in different directories
It achieves this by comparing the perceptive hash of the images, rather than any properties of the image files.
Installing required packages
findimagedupes comes with a requirements.txt file which lists all
dependencies required to run any of the scripts contained herein. You can
install them using the following command:
$ pip3 install -r requirements.txt
To avoid dumping random packages into your Python installation, it is
recommended to use VirtualEnv to
contain all the packages required. You need to create and activate your
virtualenv, then run the above pip3 command to install the missing
dependencies into your virtual environemnt. As long as the virtual environment
is active, findimagedupes scripts should be able to find and use all
required dependencies.
Example use case
-
Analyze a list of directories and generate a
hashes.dbfile.$ python3 findimagedupes.py /home/user/Nextcloud /home/user/PicturesThis will create a file
hashes.db(or, depending on your operating system, multiple files with the common prefixhashes.db) in the current working directory.You can use relative directory paths if you want to, in that case they will be stored as relative paths in the database (and copied as such to the CSV file in the next step).
-
Dump all duplicates from the
hashes.dbdatabase into CSV format.$ python3 hash_cache_to_csv.py hashes.db hashes.csvThis will create the
hashes.csvfile you will need for the next step. -
Run the merge GUI to go over the list of duplicate images, and determine whether they are really duplicates, and which one you want to keep, or if you want to overwrite one with the other.
$ python3 merge_gui.pyThis will open up a file dialog which will allow you to pick your
hashes.csvfile (or whatever you named it earlier).
findimagedupes.py
The findimagedupes script performs the main work of the process. It is
invoked with a list of directories to iterate and will find all image files
inside the directory, and create an image hash.
The image hash is performed on a downscaled 64x64 pixel version of the input image. Two hashes are performed in sequence on each image:
- A Perceptive Hash, and
- a wavelet hash.
This has been shown to eliminate most false duplicates, while catching most of the images that are actually duplicate. See the ImageHash documentation for more details on these hashing methods.
findimagedupes will attempt to efficiently use all CPU cores in your PC by
default. There is currently no configuration setting to change this; you can
change the number of worker threads directly in the script by changing the
line with the invocation of multiprocessing.cpu_count() to something else.
You can also use tools like sched_setaffinity on Linux or Process Lasso on
Windows to restrict the script to specific cores to keep your machine
responsive.
Note that the script will temporarily load an image into memory in each worker thread, so using too many threads can exhaust your system memory.
Usage:
python3 findimagedupes.py path/to/directory1 [path/to/directory2 [...]]
hash_cache_to_csv.py
This script is really simple and should finish very quickly. It reads the
previously generated hashes.db database and writes a CSV file to the
specified output path.
Usage:
python3 hash_cache_to_csv.py path/to/hashes.db path/to/output.csv
merge_gui.py
This script doesn't take any parameters, it is entirely GUI controlled. It is based on PyQt.
-
Run the merge GUI.
$ python3 merge_gui.py -
Find and open the previously generated CSV file containing all duplicates.
-
For each image displayed,
- Check that the left and right image are actually the same, otherwise click on Skip.
- Compare the EXIF metadata differences displayed below the image. If unsure, you may want to decide to keep the file with more metadata.
- If you want to delete the left or right image, click on the corresponding Keep Right or Keep Left button. Keep Right will delete the file on the left, and Keep Left will delete the image on the right.
- If you want to keep the file on one side, but in the location specified on the other, you can use the Move left to right or Move right to left buttons. They will delete the picture on the target side, and move the one from the other side into its place.
All deleted files are moved to your systems Recycle Bin, so if you make a mistake, it should be easy to restore a file you deleted by accident.
If you exit the program before the end of the list, don't fret -- when opening the same CSV file again, the merge GUI will skip all lines for which no more files can be found.
Usage:
$ python3 merge_gui.py
License
This selection of scripts is licensed under the 3-clause BSD license, see the LICENSE.txt file for details.