Bytes and Beyond
Getting rid of duplicate files: strategies and tools
It’s never a bad idea to keep multiple copies of important files. However, this can lead to having far more copies than you really need, clogging up your storage space or simply being in the way.
Getting rid of excessive duplicates isn’t quite as easy as it may seem. Before deleting files, you want to make sure they really are redundant and not just sharing the same name. To delete duplicates securely, you need two things: a good duplicate finder and a strategy to get the best use out of it.
Not all duplicates are bad
Let me briefly point out that, generally speaking, duplicates are not merely good: they are essential. Windows itself keeps multiple copies of some important files, which you should not touch. Even though having a desktop.ini file in every photo folder may seem redundant, they actually serve a purpose.
It’s generally considered a good idea to keep three copies of every file that is important to you. You may have heard of the 3-2-1 backup rule – it’s fairly simple:
3 - Keep at least three copies of your data,
2 - store two backup copies on different storage media, and
1 - store one backup off-site, i.e. away from other backups.
As basic as this sounds, following through can become tough.
Let’s assume you’re a bit like me and that you have tried to follow the 3-2-1 backup rule, then let things slip for a while and now you are trying to get back on track. This probably means that you have far more copies than you actually need, and that they’re all over the place.
I, for one, suddenly found myself with more potential duplicate folders than I could sift through by hand. They were spread across multiple drives, some of them internal, others external. Some dupes were leftovers from when I emptied thumb drives before lending them to friends, others were results of a desperate attempt to rescue data from a failing drive.
Then there was an ancient backup folder of photos which followed a categorization I have since ditched. I also unearthed three rather unwieldy music collections rescued from various portable audio players before their retirement.
So, what should you do...?
Set your goals
Your specific situation will set your specific goals in getting rid of duplicates.
If you are running short on storage space, you will want to concentrate on large files only: backup containers, videos, music and photos – in descending order. Office documents are usually too small to matter much here; even photos and audio files may not be worth your time when your external drives are clogged up with old system backups.
If you have a more specific goal, such as whipping your photo folders or your sprawling music collection in shape, your priorities will be different: You’re not just looking for exact duplicates, but also for reduced-resolution copies of photos as well as older, inferior rips of your favorite albums which you have since re-encoded in a more modern format.
In any case, make sure to get your priorities straight before you start. This could take a while, so make your time count. Few things are more frustrating than to abandon your duplicate killing spree because it has become too tedious, only to return to the task a couple of months later with only a sketchy memory of what you originally set out to do.
I actually ended up making a list with my specific needs and my goals: To clear out my internal magnetic hard drive, I need to make space on my external drives, beginning with ... As I went along, I kept updating the list and checking off items I had already accomplished. This gave me a feeling of making progress, which is essential to stay motivated. Ah yes, and I listened to a lot of my favorite music to sweeten the drudgery.
A few words of wisdom
Before you start deleting anything, make sure that you actually have three copies of everything. No, really. If necessary, you should get a fresh external drive to backup the data which you are planning to analyze before going any further.
The following advice is based on bitter personal experiences.
Make sure you only copy your data: Never move files from one drive to another. Always copy first, verify that the copied files are indeed identical, then delete the originals. Why am I stressing this point? Well, I’ve had a brand-new external drive die on me right after I had moved some irreplaceable audio recordings onto it ... never doing that again.
Windows tools such as FastCopy and TeraCopy will compare the checksums of the originals and copies after transferring them. Alternatively you can verify copied files using an external tool such as Beyond Compare or WinMerge.
Choose your tools
Duplicate finders are available for Linux, macOS and Windows. Many of them are free, some are outrageously overpriced. My recommendation is to first try out whether the free choices meet your needs.
dupeGuru (Linux, macOS, Windows) is a veteran among the free dupe finders: It dates all the way back to 2004 – and it looks the part. It has three operating modes: standard, music and pictures. "Standard mode" finds binary duplicates, i.e. files which match each others' size and content. "Music mode" compares audio file tags, thereby also finding duplicate songs encoded in different formats or at different bitrates.
Even though "picture mode" includes a fuzzy search algorithm, it lacks an integrated image viewer to enable immediate comparisons between potential duplicates. Windows users might want to pick SimilarImages or VisiPics instead. Both tools also are free.
If you are looking for duplicates in different folder or on different drives, be sure to mark one of your paths as "reference." This will speed up the removal process because the app will prevent you from deleting files from the reference path. If you’re looking for duplicates within the same directory, however, you should keep all paths as "normal".
AllDup is only available for Windows. It also includes fuzzy search methods for music and pictures and the interface is a bit more modern. The internal image preview is a bit hidden: You need to choose "File preview" from the Search Result menu to open it.
Similarity specializes in picture and audio comparisons and is available for macOS and Windows. Basic functionality is free, but most of the time-saving features are reserved for paying customers – including OpenCL acceleration and automatic duplicate selection. The premium version costs $20 for the first year, renewals are $10.
Online comparisons of free duplicate finders often mention the Windows-only Auslogics Duplicate File Finder. The tool’s interface looks friendly enough, but its functionality is severely limited: The Auslogics tool will only find exact binary duplicates. In addition, the installer tries to coax users into sharing "anonymous info," set the app to launch with every Windows start and install two additional apps. Overall, the whole thing is mostly a billboard.
dupeGuru and AllDup are decent for smaller files, but their comparison algorithms and memory management can stumble over large files, i.e. anything over 1 GB. I eventually ended up going with a commercial alternative. Duplicate Cleaner costs a one-time fee of US$39, offers a straightforward interface and will reliably identify binary duplicates, close matches and similar audio and image files. It has also proven to be very robust in handling large files. The only downside is that its German localization is bad – it is better to change the user interface to English.
Easily sift out binary duplicates
Finding binary duplicates is relatively straightforward. Instead of comparing every file bit-by-bit, the application calculates checksums of their file contents by using a hashing algorithm. Calculating these hashes will take a while – the bigger the file, the longer the while.
Most duplicate finders use MD5 or SHA1 hashes: Even though both standards are considered "broken" for cryptographic purposes, they are fast and still good enough for file comparisons. Unless you have reasons to worry that somebody would deliberately manipulate the files on your hard drive to create fake duplicates, MD5 should be good.
Before you let a duplicate finder analyze your files, you might want to check how much data you are feeding it. For a quick check, Windows users select the folder to be analyzed in Windows Explorer, press Alt+Return and check the "size" entry in the properties dialog.
Should your duplicate finder be about to process 500 GB of data or more, there is no point in sitting there, staring at the progress bar: Get yourself a coffee, come back and extrapolate how many more coffee breaks the software will spend calculating its hashes. You might even decide to let the duplicate finder do its thing overnight and check on the results in the morning.
Identify duplicate music and pictures
If you think that binary comparison takes its sweet time, wait until you start to compare pictures and music. Comparing images requires far more computing resources than simple checksum calculations. This is why you should probably start with a binary comparison (fastest), then try an image comparison excluding EXIF metadata (still fairly fast) and finally go for similarities (coffee break time).
Comparing audio files can be done in a similar manner: In Duplicate Cleaner Pro, I first go for "Match exact audio data (ignore tags)" then I proceed with "Similar audio - Compare full file" – even though these modes take time, they provide the most reliable results. "Match audio tags only" can also work (set "Similar artist", "Same title" and "Similar album"), but results depend entirely on how well your music libraries are tagged.
Digital housecleaning: eliminate empty folders
Most duplicate cleaners try to clean up after themselves: If deleting the duplicates produces empty subfolders, they will offer to delete these too. However, nested folders often result in leftovers.
The solution is either a simple batch file or a special tool. I have grown fond of the Windows freeware "Remove Empty Directories" which works quickly and provides the option to whitelist folders you may want to keep.
What to do to prevent repeats
Thoroughly getting rid of your duplicates can take days. With large binary duplicates, most of the time is consumed by the comparisons themselves. When looking for redundant images, a lot of the time is consumed by double-checking whether duplicates below the 90% threshold were detected correctly. With audio files, the comparisons can take a long time, but the process of elimination is fairly straightforward.
My personal de-duplication odyssey took way more time than I had expected. I ended up with two empty 4 TB drives. Over the course of my digital housecleaning, I also discovered that three external drives were starting to fail. If I hadn’t discovered this just in time, I would probably have lost some significant data.
Be sure to let me know in the comments how you handle your duplicates. Do you have differing strategies which have worked for you? What tools do you use?