Cautions (or why it’s hard to write a dupefinder)

This section covers good practice for safe duplicate removal. It is not intended to be specifically related to rmlint. It includes general discussion on duplicate detection and shows, by example, some of the traps that duplicate finders can fall into. This section might not only be useful for developers of dupe finders, but also educational for users that strive for best practices regarding deduplication.

Good Practice when Deleting Duplicates

Backup your data

There is a wise adage, “if it’s not backed up, it’s not important”. It’s just good practice to keep your important data backed up. In particular, any time you are contemplating doing major file reorganisations or deletions, that’s a good time to make sure that your backups are up to date.

What about when you want to clean up your backups by deleting duplicate files from your backup drive? Well as long as your duplicate file finder is reliable, you shouldn’t have any problems. Consider replacing the duplicate with a link (hardlink, symlink or reflink) to the original data. This still frees up the space, but makes it easier to find the file if and when it comes time to restore files from backup.

Measure twice, cut once

This is a popular saying amongst builders; the same goes for your files. Do at least some sort of sanity check on which files are going to be deleted. All duplicate file finders (including rmlint) are capable of identifying false positives or more serious bugs.

Beware of unusual filename characters

Even a space in a filename is capable of causing grief. Make sure your file deletion command (or the one used by your duplicate finder) has the filename properly escaped.

Consider safe removal options

Rather than deleting duplicates, consider moving them to a holding area or trash folder. The trash-cli utility is one option for this. Alternatively if using the rmlint shell script you can replace the remove_cmd section as follows to move the files to /tmp:

remove_cmd() {
    echo 'Deleting:' "$1"
    if original_check "$1" "$2"; then
        if [ -z "$DO_DRY_RUN" ]; then
            # was: rm -rf "$1"
            mv -p "$1" "/tmp$1"
        fi
    fi
}

Another safe alternative, if your files are on a copy-on-write filesystem such as btrfs, and you have linux kernel 4.2 or higher, is to use a deduplication utility such as duperemove or rmlint --dedupe:

$ duperemove -dh original duplicate
$ rmlint --dedupe original duplicate

Both of the above first verify (via the kernel) that original and duplicate are identical, then modifies duplicate to reference original‘s data extents. Note they do not change the mtime or other metadata of the duplicate (unlike hardlinks).

If you pass -c sh:link to rmlint, it will even check for you if your filesystem is capable of reflinks and emit the correct command conveniently.

You might think hardlinking as a safe alternative to deletion, but in fact hardlinking first deletes the duplicate and then creates a hardlink to the original in its place. If your duplicate finder has found a false positive, it is possible that you may lose your data.

Attributes of a Robust Duplicate Finder

(also known as “Traps for young dupe finders”)

Traversal Robustness

Path Doubles

One simple trap for a dupe finder is to not realise that it has reached the same file via two different paths. This can happen due to user inputting overlapping paths to traverse, or due to symlinks or other filesystem loops such as bind mounts. Here’s a simple “path double” example trying to trick a duplicate file finder by “accidentally” feeding it the same path twice. We’ll use fdupes for this example:

$ mkdir dir
$ echo "important" > dir/file
$ fdupes -r --delete --noprompt dir dir
[no output]
$ ls dir
file

So far so good, fdupes didn’t fall for the trick. It has a check built-in which looks at the files’ device and inode numbers, which automatically filters out path doubles.

Let’s try again using the -H option to find hardlinked duplicates:

$ fdupes -r -H --delete --noprompt dir dir
   [+] dir/file
   [-] dir/file
$ ls -l dir/
total 0

Oh dear, our file is gone! The problem is that hardlinks share the same device and inode numbers, so the inode check is turned off for this option.

Dupe finders rdfind and dupd can also be tricked with the right combination of settings:

$ rdfind -removeidentinode false -deleteduplicates true a a
[snip]
Now deleting duplicates:
Deleted 1 files.
$ ls -l dir/
total 0

$ dupd scan --path /home/foo/a --path /home/foo/a
Files scanned: 2
Total duplicates: 2
Run 'dupd report' to list duplicates.
$ dupd report
Duplicate report from database /home/foo/.dupd_sqlite:
20 total bytes used by duplicates:
  /home/foo/a/data
  /home/foo/a/data

Solution:

For a duplicate finder to be able to find hardlinked duplicates, without also inadvertently identifying a file as a duplicate or itself, a more sophisticated test is required. Path doubles will always have:

  • matching device and inode.
  • matching basename.
  • parent directories also have matching device and inode.

That seems pretty fool-proof (see rmlint example below) but please file an issue on our Issue Tracker if you find an exception.

$ echo "data" > dir/file
$ # rmlint with default settings:
$  rmlint dir dir
==> In total 2 files, whereof 0 are duplicates in 0 groups.
==> This equals 0 B of duplicates which could be removed.
$
$ # rmlint with hardlink duplicate detection enabled:
$  rmlint --hardlinked dir dir
==> In total 2 files, whereof 0 are duplicates in 0 groups.
==> This equals 0 B of duplicates which could be removed.
$ ls dir
file

Symlinks:

“Ah but I’m not silly enough to enter the same path twice” you say. Well maybe so, but there are other ways that folder traversal can reach the same path twice, for example via symbolic links:

$ mkdir dir
$ echo "important" > dir/file
$ ln -s dir link
$ fdupes -r --delete --noprompt .
$ ls -l dir/
total 0

Symlinks can make a real mess out of filesystem traversal:

$ mkdir dir
$ cd dir
$ ln -s . link
$ cd ..
$ echo "data" > dir/file
$ fdupes -rHs dir
dir/file
dir/link/file
dir/link/link/file
[snip]
dir/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/file

Set 1 of 1, preserve files [1 - 41, all]:

Solution:

During traversal, the duplicate finder should keep track of all folders visited (by device and inode number). Don’t re-traverse folders that were already traversed.

Hardlinks:

Also as noted above, replacing duplicates with hardlinks can still end badly if there are false positives. For example, using rdfind‘s the -makehardlinks option:

$ echo "data" > dir/file
$ rdfind -removeidentinode false -makehardlinks true dir dir
[snip]
It seems like you have 2 files that are not unique
Totally, 5 b can be reduced.
Now making results file results.txt
Now making hard links.
failed to make hardlink dir/file to dir/file
$ ls -l dir
total 0

Solution:

Don’t find false positives. Check files are on same filesystem before trying to create hardlink. Temporarily rename the duplicate before creating the hardlink and then deleting the renamed file.

Collision Robustness

Duplicate detection by file hash

If a duplicate finder uses file hashes to identify duplicates, there is a very small risk that two different files have the same hash value. This is called a hash collision and can result in the two files being falsely flagged as duplicates.

Several duplicate finders use the popular MD5 Hash, which is 128 bits long. With a 128-bit hash, if you have a million sets of same-size files, each set containing a million different files, the chance of a hash collision is about 0.000 000 000 000 000 000 147%. To get a 0.1% chance of a hash collision you would need nine hundred thousand million (\(9\times10^{11}\)) groups of (\(9\times10^{11}\)) files each, or one group of eight hundred thousand million million (\(8\times10^{17}\)) files.

If someone had access to your files, and wanted to create a malicious duplicate, they could potentially do something like this (based on http://web.archive.org/web/20071226014140/http://www.cits.rub.de/MD5Collisions/):

$ mkdir test && cd test
$ # get two different files with same md5 hash:
$ wget http://web.archive.org/web/20071226014140/http://www.cits.rub.de/imperia/md/content/magnus/order.ps
$ wget http://web.archive.org/web/20071226014140/http://www.cits.rub.de/imperia/md/content/magnus/letter_of_rec.ps
$ md5sum *  # verify that they have the same md5sum
a25f7f0b29ee0b3968c860738533a4b9  letter_of_rec.ps
a25f7f0b29ee0b3968c860738533a4b9  order.ps
$ sha1sum * # verify that they are not actually the same
07835fdd04c9afd283046bd30a362a6516b7e216  letter_of_rec.ps
3548db4d0af8fd2f1dbe02288575e8f9f539bfa6  order.ps
$ rmlint -a md5 . -o pretty  # run rmlint using md5 hash for duplicate file detection
# Duplicate(s):
    ls '/home/foo/test/order.ps'
    rm '/home/foo/test/letter_of_rec.ps'
$ rmlint test -a sha1 -o summary   # run using sha1 hash
==> In total 2 files, whereof 0 are duplicates in 0 groups.

If your intention was to free up space by hardlinking the duplicate to the original, you would end up with two hardlinked files, one called order.ps and the other called letter_of_rec.ps, both containing the contents of order.ps.

Solution:

fdupes detects duplicates using MD5 Hashes, but eliminates the collision risk by doing a byte-wise comparison of the duplicates detected. This means each file is read twice, which can tend to slow things down.

dupd uses direct file comparison, unless there are more than 3 files in a set of duplicates, in which case it uses MD5 only.

If you use rmlint‘s sha1 hash features, which features 160 bit output, you need at least \(5.4\times10^{22}\) files before you get a \(0.1\%\) probability of collision. rmlint‘s -p option uses SHA512 (\(5.2\times10^{75}\) files for \(0.1\%\) risk), while rmlint‘s -pp option uses direct file comparison to eliminate the risk altogether. Refer to the Benchmarks chapter for speed and memory overhead implications.

Unusual Characters Robustness

Spaces, commas, nonprinting characters etc can all potentially trip up a duplicate finder or the subsequent file deletion command. For example:

$ mkdir test
$ echo "data" > 'test/\t\r\"\b\f\\,.'
$ cp test/\\t\\r\\\"\\b\\f\\\\\,. test/copy  # even just copying filenames like this is ugly!
$ ls -1 test/
copy
\t\r\"\b\f\\,.
$ md5sum test/*  # md5's output gets a little bit mangled by the odd characters
6137cde4893c59f76f005a8123d8e8e6  test/copy
\6137cde4893c59f76f005a8123d8e8e6  test/\\t\\r\\"\\b\\f\\\\,.
$ dupd scan --path /home/foo/test
SKIP (comma) [/home/foo/test/\t\r\"\b\f\\,.]
Files scanned: 1
Total duplicates: 0

Solution: Be careful!

“Seek Thrash” Robustness

Duplicate finders use a range of strategies to find duplicates. It is common to reading and compare small increments of potential duplicates. This avoids the need to read the whole file if the files differ in the first few megabytes, so this can give a major speedup in some cases. However, in the case of hard disk drives, constantly reading small increments from several files at the same time causes the hard drive head to have to jump around (“seek thrash”).

Here are some speed test results showing relative speed for scanning my /usr folder (on SSD) and an HDD copy of same. The speed ratio gives an indication of how effectively the search algorithm manages disk seek overheads:

Program /usr (SSD) /mnt/usr (HDD) Ratio
dupd 48s 1769s 36.9
fdupes 65s 486s 7.5
rmlint 38s 106s 2.8
rmlint -pp 40s 139s 3.5

Note

Before each run, disk caches were cleared:

$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Solution:

Achieving good speeds on HDD’s requires a balance between small file increments early on, then switching to bigger file increments. Fiemap information (physical location of files on the disk) can be used to sort the files into an order that reduces disk seek times.

Memory Usage Robustness

When scanning very large filesystems, duplicate finders may have to hold a large amount of information in memory at the same time. Once this information exceeds the computers’ RAM, performance will suffer significantly. dupd handles this quite nicely by storing a lot of the data in a sqlite database file, although this may have a slight performance penalty due to disk read/write time to the database file. rmlint uses a path tree structure to reduce the memory required to store all traversed paths.