This section covers good practice for safe duplicate removal. It is not intended to
be specifically related to rmlint
. It includes general discussion on duplicate
detection and shows, by example, some of the traps that duplicate finders can fall into.
This section might not only be useful for developers of dupe finders, but also
educational for users that strive for best practices regarding deduplication.
There is a wise adage, “if it’s not backed up, it’s not important”. It’s just good practice to keep your important data backed up. In particular, any time you are contemplating doing major file reorganisations or deletions, that’s a good time to make sure that your backups are up to date.
What about when you want to clean up your backups by deleting duplicate files from your backup drive? Well as long as your duplicate file finder is reliable, you shouldn’t have any problems. Consider replacing the duplicate with a link (hardlink, symlink or reflink) to the original data. This still frees up the space, but makes it easier to find the file if and when it comes time to restore files from backup.
This is a popular saying amongst builders; the same goes for your files. Do at least some
sort of sanity check on which files are going to be deleted. All duplicate file finders
(including rmlint
) are capable of identifying false positives or more serious bugs.
Even a space in a filename is capable of causing grief. Make sure your file deletion command (or the one used by your duplicate finder) has the filename properly escaped.
Rather than deleting duplicates, consider moving them to a holding area or trash
folder. The trash-cli utility is one option for this. Alternatively if
using the rmlint
shell script you can replace the remove_cmd
section as
follows to move the files to /tmp:
remove_cmd() {
echo 'Deleting:' "$1"
if original_check "$1" "$2"; then
if [ -z "$DO_DRY_RUN" ]; then
# was: rm -rf "$1"
mv -p "$1" "/tmp$1"
fi
fi
}
Another safe alternative, if your files are on a btrfs
filesystem and you have linux
kernel 4.2 or higher, is to reflink the duplicate to the original. You can do this via
cp --reflink
or using rmlint --btrfs-clone
:
$ cp --reflink=always original duplicate # deletes duplicate and replaces it with reflink copy of original
$ rmlint --btrfs-clone original duplicate # does and in-place clone
If you pass -c sh:link
to rmlint
, it will even check for you if your
filesystem is capable of reflinks and emit the correct command conveniently.
The second option is actually safer because it verifies (via the kernel) that the files are identical before creating the reflink. Also it does not change the mtime or other metadata of the duplicate.
You might think hardlinking as a safe alternative to deletion, but in fact hardlinking first deletes the duplicate and then creates a hardlink to the original in its place. If your duplicate finder has found a false positive, it is possible that you may lose your data.
(also known as “Traps for young dupe finders”)
Path Doubles
One simple trap for a dupe finder is to not realise that it has reached the same file via two different paths. This can happen due to user inputting overlapping paths to traverse, or due to symlinks or other filesystem loops such as bind mounts. Here’s a simple “path double” example trying to trick a duplicate file finder by “accidentally” feeding it the same path twice. We’ll use fdupes for this example:
$ mkdir dir
$ echo "important" > dir/file
$ fdupes -r --delete --noprompt dir dir
[no output]
$ ls dir
file
So far so good, fdupes
didn’t fall for the trick. It has a check built-in which looks at
the files’ device and inode numbers, which automatically filters out path doubles.
Let’s try again using the -H option to find hardlinked duplicates:
$ fdupes -r -H --delete --noprompt dir dir
[+] dir/file
[-] dir/file
$ ls -l dir/
total 0
Oh dear, our file is gone! The problem is that hardlinks share the same device and inode numbers, so the inode check is turned off for this option.
Dupe finders rdfind
and dupd
can also be tricked with the right combination of settings:
$ rdfind -removeidentinode false -deleteduplicates true a a
[snip]
Now deleting duplicates:
Deleted 1 files.
$ ls -l dir/
total 0
$ dupd scan --path /home/foo/a --path /home/foo/a
Files scanned: 2
Total duplicates: 2
Run 'dupd report' to list duplicates.
$ dupd report
Duplicate report from database /home/foo/.dupd_sqlite:
20 total bytes used by duplicates:
/home/foo/a/data
/home/foo/a/data
Solution:
For a duplicate finder to be able to find hardlinked duplicates, without also inadvertently identifying a file as a duplicate or itself, a more sophisticated test is required. Path doubles will always have:
That seems pretty fool-proof (see rmlint
example below) but please file an issue
on our Issue Tracker if you find an exception.
$ echo "data" > dir/file
$ # rmlint with default settings:
$ rmlint dir dir
==> In total 2 files, whereof 0 are duplicates in 0 groups.
==> This equals 0 B of duplicates which could be removed.
$
$ # rmlint with hardlink duplicate detection enabled:
$ rmlint --hardlinked dir dir
==> In total 2 files, whereof 0 are duplicates in 0 groups.
==> This equals 0 B of duplicates which could be removed.
$ ls dir
file
Symlinks:
“Ah but I’m not silly enough to enter the same path twice” you say. Well maybe so, but there are other ways that folder traversal can reach the same path twice, for example via symbolic links:
$ mkdir dir
$ echo "important" > dir/file
$ ln -s dir link
$ fdupes -r --delete --noprompt .
$ ls -l dir/
total 0
Symlinks can make a real mess out of filesystem traversal:
$ mkdir dir
$ cd dir
$ ln -s . link
$ cd ..
$ echo "data" > dir/file
$ fdupes -rHs dir
dir/file
dir/link/file
dir/link/link/file
[snip]
dir/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/file
Set 1 of 1, preserve files [1 - 41, all]:
Solution:
During traversal, the duplicate finder should keep track of all folders visited (by device and inode number). Don’t re-traverse folders that were already traversed.
Hardlinks:
Also as noted above, replacing duplicates with hardlinks can still end badly if there are
false positives. For example, using rdfind
’s the -makehardlinks
option:
$ echo "data" > dir/file
$ rdfind -removeidentinode false -makehardlinks true dir dir
[snip]
It seems like you have 2 files that are not unique
Totally, 5 b can be reduced.
Now making results file results.txt
Now making hard links.
failed to make hardlink dir/file to dir/file
$ ls -l dir
total 0
Solution:
Don’t find false positives. Check files are on same filesystem before trying to create hardlink. Temporarily rename the duplicate before creating the hardlink and then deleting the renamed file.
Duplicate detection by file hash
If a duplicate finder uses file hashes to identify duplicates, there is a very small risk that two different files have the same hash value. This is called a hash collision and can result in the two files being falsely flagged as duplicates.
Several duplicate finders use the popular MD5 Hash, which is 128 bits
long. With a 128-bit hash, if you have a million sets of same-size files, each set containing
a million different files, the chance of a hash collision is about
0.000 000 000 000 000 000 147%
. To get a 0.1%
chance of a hash collision you would
need nine hundred thousand million (\(9\times10^{11}\)) groups of (\(9\times10^{11}\)) files each, or one group
of eight hundred thousand million million (\(8\times10^{17}\)) files.
If someone had access to your files, and wanted to create a malicious duplicate, they could potentially do something like this (based on http://web.archive.org/web/20071226014140/http://www.cits.rub.de/MD5Collisions/):
$ mkdir test && cd test
$ # get two different files with same md5 hash:
$ wget http://web.archive.org/web/20071226014140/http://www.cits.rub.de/imperia/md/content/magnus/order.ps
$ wget http://web.archive.org/web/20071226014140/http://www.cits.rub.de/imperia/md/content/magnus/letter_of_rec.ps
$ md5sum * # verify that they have the same md5sum
a25f7f0b29ee0b3968c860738533a4b9 letter_of_rec.ps
a25f7f0b29ee0b3968c860738533a4b9 order.ps
$ sha1sum * # verify that they are not actually the same
07835fdd04c9afd283046bd30a362a6516b7e216 letter_of_rec.ps
3548db4d0af8fd2f1dbe02288575e8f9f539bfa6 order.ps
$ rmlint -a md5 . -o pretty # run rmlint using md5 hash for duplicate file detection
# Duplicate(s):
ls '/home/foo/test/order.ps'
rm '/home/foo/test/letter_of_rec.ps'
$ rmlint test -a sha1 -o summary # run using sha1 hash
==> In total 2 files, whereof 0 are duplicates in 0 groups.
If your intention was to free up space by hardlinking the duplicate to the original, you would end up with two
hardlinked files, one called order.ps
and the other called
letter_of_rec.ps
, both containing the contents of order.ps
.
Solution:
fdupes
detects duplicates using MD5 Hashes, but eliminates the collision
risk by doing a byte-wise comparison of the duplicates detected. This means
each file is read twice, which can tend to slow things down.
dupd
uses direct file comparison, unless there are more than 3 files in a set of duplicates, in which
case it uses MD5 only.
If you use rmlint
’s sha1
hash features, which features 160 bit output,
you need at least \(5.4\times10^{22}\) files before you get a \(0.1\%\)
probability of collision. rmlint
’s -p
option uses SHA512
(\(5.2\times10^{75}\) files for \(0.1\%\) risk), while rmlint
’s
-pp
option uses direct file comparison to eliminate the risk altogether.
Refer to the Benchmarks chapter for speed and memory overhead
implications.
Spaces, commas, nonprinting characters etc can all potentially trip up a duplicate finder or the subsequent file deletion command. For example:
$ mkdir test
$ echo "data" > 'test/\t\r\"\b\f\\,.'
$ cp test/\\t\\r\\\"\\b\\f\\\\\,. test/copy # even just copying filenames like this is ugly!
$ ls -1 test/
copy
\t\r\"\b\f\\,.
$ md5sum test/* # md5's output gets a little bit mangled by the odd characters
6137cde4893c59f76f005a8123d8e8e6 test/copy
\6137cde4893c59f76f005a8123d8e8e6 test/\\t\\r\\"\\b\\f\\\\,.
$ dupd scan --path /home/foo/test
SKIP (comma) [/home/foo/test/\t\r\"\b\f\\,.]
Files scanned: 1
Total duplicates: 0
Solution: Be careful!
Duplicate finders use a range of strategies to find duplicates. It is common to reading and compare small increments of potential duplicates. This avoids the need to read the whole file if the files differ in the first few megabytes, so this can give a major speedup in some cases. However, in the case of hard disk drives, constantly reading small increments from several files at the same time causes the hard drive head to have to jump around (“seek thrash”).
Here are some speed test results showing relative speed for scanning my /usr
folder (on SSD) and an HDD copy of same.
The speed ratio gives an indication of how effectively the search algorithm manages disk seek overheads:
Program | /usr (SSD) |
/mnt/usr (HDD) |
Ratio |
---|---|---|---|
dupd |
48s | 1769s | 36.9 |
fdupes |
65s | 486s | 7.5 |
rmlint |
38s | 106s | 2.8 |
rmlint -pp |
40s | 139s | 3.5 |
Note
Before each run, disk caches were cleared:
$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
Solution:
Achieving good speeds on HDD’s requires a balance between small file increments early on, then switching to bigger file increments. Fiemap information (physical location of files on the disk) can be used to sort the files into an order that reduces disk seek times.
When scanning very large filesystems, duplicate finders may have to hold a large amount of information in
memory at the same time. Once this information exceeds the computers’ RAM, performance will suffer
significantly. dupd
handles this quite nicely by storing a lot of the data in a sqlite database file,
although this may have a slight performance penalty due to disk read/write time to the database file.
rmlint
uses a path tree structure to reduce the memory required to store all traversed paths.