rmlint¶
find duplicate files and other space waste efficiently¶
SYNOPSIS¶
rmlint [TARGET_DIR_OR_FILES …] [//] [TAGGED_TARGET_DIR_OR_FILES …] [-] [OPTIONS]
DESCRIPTION¶
rmlint
finds space waste and other broken things on your filesystem.
Its main focus lies on finding duplicate files and directories.
It is able to find the following types of lint:
- Duplicate files and directories (and as a by-product unique files).
- Nonstripped Binaries (Binaries with debug symbols; needs to be explicitly enabled).
- Broken symbolic links.
- Empty files and directories (also nested empty directories).
- Files with broken user or group id.
rmlint
itself WILL NOT DELETE ANY FILES. It does however produce executable
output (for example a shell script) to help you delete the files if you want
to. Another design principle is that it should work well together with other
tools like find
. Therefore we do not replicate features of other well know
programs, as for example pattern matching and finding duplicate filenames.
However we provide many convenience options for common use cases that are hard
to build from scratch with standard tools.
In order to find the lint, rmlint
is given one or more directories to traverse.
If no directories or files were given, the current working directory is assumed.
By default, rmlint
will ignore hidden files and will not follow symlinks (see
Traversal Options). rmlint
will first find “other lint” and then search
the remaining files for duplicates.
rmlint
tries to be helpful by guessing what file of a group of duplicates
is the original (i.e. the file that should not be deleted). It does this by using
different sorting strategies that can be controlled via the -S
option. By
default it chooses the first-named path on the commandline. If two duplicates
come from the same path, it will also apply different fallback sort strategies
(See the documentation of the -S
strategy).
This behaviour can be also overwritten if you know that a certain directory
contains duplicates and another one originals. In this case you write the
original directory after specifying a single //
on the commandline.
Everything that comes after is a preferred (or a “tagged”) directory. If there
are duplicates from an unpreferred and from a preferred directory, the preferred
one will always count as original. Special options can also be used to always
keep files in preferred directories (-k
) and to only find duplicates that
are present in both given directories (-m
).
We advise new users to have a short look at all options rmlint
has to
offer, and maybe test some examples before letting it run on productive data.
WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are some extended
example at the end of this manual, but each option that is not self-explanatory
will also try to give examples.
OPTIONS¶
General Options¶
-T --types="list" (default: defaults): | |
---|---|
Configure the types of lint rmlint will look for. The list string is a comma-separated list of lint types or lint groups (other separators like semicolon or space also work though). One of the following groups can be specified at the beginning of the list:
Any of the following lint types can be added individually, or deselected by prefixing with a -:
WARNING: It is good practice to enclose the description in single or double quotes. In obscure cases argument parsing might fail in weird ways, especially when using spaces as separator. Example: $ rmlint -T "df,dd" # Only search for duplicate files and directories
$ rmlint -T "all -df -dd" # Search for all lint except duplicate files and dirs.
|
|
-o --output=spec / -O --add-output=spec (default: -o sh:rmlint.sh -o pretty:stdout -o summary:stdout -o json:rmlint.json): | |
Configure the way If Examples: $ rmlint -o json # Stream the json output to stdout
$ rmlint -O csv:/tmp/rmlint.csv # Output an extra csv file to /tmp
|
|
-c --config=spec[=value] (default: none): | |
Configure a format. This option can be used to fine-tune the behaviour of the existing formatters. See the Formatters section for details on the available keys. If the value is omitted it is set to a value meaning “enabled”. Examples: $ rmlint -c sh:link # Smartly link duplicates instead of removing
$ rmlint -c progressbar:fancy # Use a different theme for the progressbar
|
|
-z --perms[=[rwx]] (default: no check): | |
Only look into file if it is readable, writable or executable by the current user. Which one of the can be given as argument as one of “rwx”. If no argument is given, “rw” is assumed. Note that r does basically
nothing user-visible since By default this check is not done.
|
|
-a --algorithm=name (default: blake2b): | |
Choose the algorithm to use for finding duplicate files. The algorithm can be either paranoid (byte-by-byte file comparison) or use one of several file hash algorithms to identify duplicates. The following hash families are available (in approximate descending order of cryptographic strength): sha3, blake, sha, highway, md metro, murmur, xxhash The weaker hash functions still offer excellent distribution properties, but are potentially more vulnerable to malicious crafting of duplicate files. The full list of hash functions (in decreasing order of checksum length) is: 512-bit: blake2b, blake2bp, sha3-512, sha512 384-bit: sha3-384, 256-bit: blake2s, blake2sp, sha3-256, sha256, highway256, metro256, metrocrc256 160-bit: sha1 128-bit: md5, murmur, metro, metrocrc 64-bit: highway64, xxhash. The use of 64-bit hash length for detecting duplicate files is not recommended, due to the probability of a random hash collision. |
|
-p --paranoid / -P --less-paranoid (default): | |
Increase or decrease the paranoia of
|
|
-v --loud / -V --quiet : | |
Increase or decrease the verbosity. You can pass these options several
times. This only affects |
|
-g --progress / -G --no-progress (default): | |
Show a progressbar with sane defaults. Convenience shortcut for NOTE: This flag clears all previous outputs. If you want additional
outputs, specify them after this flag using |
|
-D --merge-directories (default: disabled): | |
Makes rmlint use a special mode where all found duplicates are collected and
checked if whole directory trees are duplicates. Use with caution: You
always should make sure that the investigated directory is not modified
during IMPORTANT: Definition of equal: Two directories are considered equal by
Output is deferred until all duplicates were found. Duplicate directories are printed first, followed by any remaining duplicate files that are isolated or inside of any original directories. –rank-by applies for directories too, but ‘p’ or ‘P’ (path index) has no defined (i.e. useful) meaning. Sorting takes only place when the number of preferred files in the directory differs. NOTES:
|
|
-j --honour-dir-layout (default: disabled): | |
Only recognize directories as duplicates that have the same path layout. In
other words: All duplicates that build the duplicate directory must have
the same path from the root of each respective directory.
This flag makes no sense without |
|
-y --sort-by=order (default: none): | |
During output, sort the found duplicate groups by criteria described by order. order is a string that may consist of one or more of the following letters:
The letter may also be written uppercase (similar to |
|
-w --with-color (default) / -W --no-with-color : | |
Use color escapes for pretty output or disable them.
If you pipe rmlints output to a file |
|
-h --help / -H --show-man : | |
Show a shorter reference help text ( |
|
--version : | Print the version of rmlint. Includes git revision and compile time features. Please include this when giving feedback to us. |
Traversal Options¶
-s --size=range (default: “1”): | |
---|---|
Only consider files as duplicates in a certain size range. The format of range is min-max, where both ends can be specified as a number with an optional multiplier. The available multipliers are:
The size format is about the same as dd(1) uses. A valid example would be: “100KB-2M”. This limits duplicates to a range from 100 Kilobyte to 2 Megabyte. It’s also possible to specify only one size. In this case the size is
interpreted as “bigger or equal”. If you want to filter for files
up to this size you can add a Edge case: The default excludes empty files from the duplicate search.
Normally these are treated specially by
|
|
-d --max-depth=depth (default: INF): | |
Only recurse up to this depth. A depth of 1 would disable recursion and is equivalent to a directory listing. A depth of 2 would also consider all children directories and so on. |
|
-l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked : | |
Hardlinked files are treated as duplicates by default ( If –no-hardlinked is given, only one file (of a set of hardlinked files) is considered, all the others are ignored; this means, they are not deleted and also not even shown in the output. The “highest ranked” of the set is the one that is considered. |
|
-f --followlinks / -F --no-followlinks / -@ --see-symlinks (default): | |
|
|
-x --no-crossdev / -X --crossdev (default): | |
Stay always on the same device ( |
|
-r --hidden / -R --no-hidden (default) / --partial-hidden : | |
Also traverse hidden directories? This is often not a good idea, since
directories like |
|
-b --match-basename : | |
Only consider those files as dupes that have the same basename. See also
|
|
-B --unmatched-basename : | |
Only consider those files as dupes that do not share the same basename.
See also |
|
-e --match-with-extension / -E --no-match-with-extension (default): | |
Only consider those files as dupes that have the same file extension. For
example two photos would only match if they are a |
|
-i --match-without-extension / -I --no-match-without-extension (default): | |
Only consider those files as dupes that have the same basename minus the file
extension. For example: |
|
-n --newer-than-stamp=<timestamp_filename> / -N --newer-than=<iso8601_timestamp_or_unix_timestamp> : | |
Only consider files (and their size siblings for duplicates) newer than a certain modification time (mtime). The age barrier may be given as seconds since the epoch or as ISO8601-Timestamp like 2014-09-08T00:12:32+0200.
Note that
Note: you can make rmlint write out a compatible timestamp with:
|
Original Detection Options¶
-k --keep-all-tagged / -K --keep-all-untagged : | |
---|---|
Don’t delete any duplicates that are in tagged paths ( |
|
-m --must-match-tagged / -M --must-match-untagged : | |
Only look for duplicates of which at least one is in one of the tagged paths. (Paths that were named after //). Note that the combinations of |
|
-S --rank-by=criteria (default: pOma): | |
Sort the files in a group of duplicates into originals and duplicates by one or more criteria. Each criteria is defined by a single letter (except r and x which expect a regex pattern after the letter). Multiple criteria may be given as string, where the first criteria is the most important. If one criteria cannot decide between original and duplicate the next one is tried.
Alphabetical sort will only use the basename of the file and ignore its case.
One can have multiple criteria, e.g.: For more fine grained control, it is possible to give a regular expression
to sort by. This can be useful when you know a common fact that identifies
original paths (like a path component being To use the regular expression you simply enclose it in the criteria string
by adding <REGULAR_EXPRESSION> after specifying r or x. Example: Warning: When using r or x, try to make your regex to be as specific
as possible! Good practice includes adding a Tips:
|
Caching¶
--replay : | Read an existing json file and re-output it. When This is very useful if you want to reformat, refilter or resort the output
you got from a previous run. Usage is simple: Just pass If you want to view only the duplicates of certain subdirectories, just pass them on the commandline as usual. The usage of By design, some options will not have any effect. Those are:
NOTE: In |
---|---|
-C --xattr : | Shortcut for See the individual options below for more details and some examples. |
--xattr-read / --xattr-write / --xattr-clear : | |
Read or write cached checksums from the extended file attributes. This feature can be used to speed up consecutive runs. CAUTION: This could potentially lead to false positives if file contents are somehow modified without changing the file modification time. rmlint uses the mtime to determine the modification timestamp if a checksum is outdated. This is not a problem if you use the clone or reflink operation on a filesystem like btrfs. There an outdated checksum entry would simply lead to some duplicate work done in the kernel but would do no harm otherwise. NOTE: Many tools do not support extended file attributes properly, resulting in a loss of the information when copying the file or editing it. NOTE: You can specify Usage example: $ rmlint large_file_cluster/ -U --xattr-write # first run should be slow.
$ rmlint large_file_cluster/ --xattr-read # second run should be faster.
# Or do the same in just one run:
$ rmlint large_file_cluster/ --xattr
|
|
-U --write-unfinished : | |
Include files in output that have not been hashed fully, i.e. files that do
not appear to have a duplicate. Note that this will not include all files
that This is mainly useful in conjunction with If you want to output unique files, please look into the |
Rarely used, miscellaneous options¶
-t --threads=N (default: 16): | |
---|---|
The number of threads to use during file tree traversal and hashing.
|
|
-u --limit-mem=size : | |
Apply a maximum number of memory to use for hashing and –paranoid.
The total number of memory might still exceed this limit though, especially
when setting it very low. In general The
|
|
-q --clamp-low=[fac.tor|percent%|offset] (default: 0) / -Q --clamp-top=[fac.tor|percent%|offset] (default: 1.0): | |
The argument can be either passed as factor (a number with a Only look at the content of files in the range of from This is useful in a few cases where a file consists of a constant sized header or footer. With this option you can just compare the data in between. Also it might be useful for approximate comparison where it suffices when the file is the same in the middle part. Example:
|
|
-Z --mtime-window=T (default: -1): | |
Only consider those files as duplicates that have the same content and the same modification time (mtime) within a certain window of T seconds. If T is 0, both files need to have the same mtime. For T=1 they may differ one second and so on. If the window size is negative, the mtime of duplicates will not be considered. T may be a floating point number. However, with three (or more) files, the mtime difference between two duplicates can be bigger than the mtime window T, i.e. several files may be chained together by the window. Example: If T is 1, the four files fooA (mtime: 00:00:00), fooB (00:00:01), fooC (00:00:02), fooD (00:00:03) would all belong to the same duplicate group, although the mtime of fooA and fooD differs by 3 seconds. |
|
--with-fiemap (default) / --without-fiemap : | |
Enable or disable reading the file extents on rotational disk in order to optimize disk access patterns. If this feature is not available, it is disabled automatically. |
FORMATTERS¶
csv
: Output all found lint as comma-separated-value list.Available options:
- no_header: Do not write a first line describing the column headers.
- unique: Include unique files in the output.
sh
: Output all found lint as shell script This formatter is activatedas default.
available options:
cmd: Specify a user defined command to run on duplicates. The command can be any valid
/bin/sh
-expression. The duplicate path and original path can be accessed via"$1"
and"$2"
. The command will be written to theuser_command
function in thesh
-file produced by rmlint.handler Define a comma separated list of handlers to try on duplicate files in that given order until one handler succeeds. Handlers are just the name of a way of getting rid of the file and can be any of the following:
clone
: For reflink-capable filesystems only. Try to clone both files with the FIDEDUPERANGEioctl(3p)
(or BTRFS_IOC_FILE_EXTENT_SAME on older kernels). This will free up duplicate extents. Needs at least kernel 4.2. Use this option when you only have read-only access to a btrfs filesystem but still want to deduplicate it. This is usually the case for snapshots.reflink
: Try to reflink the duplicate file to the original. See also--reflink
inman 1 cp
. Fails if the filesystem does not support it.hardlink
: Replace the duplicate file with a hardlink to the original file. The resulting files will have the same inode number. Fails if both files are not on the same partition. You can usels -i
to show the inode number of a file andfind -samefile <path>
to find all hardlinks for a certain file.symlink
: Tries to replace the duplicate file with a symbolic link to the original. This handler never fails.remove
: Remove the file usingrm -rf
. (-r
for duplicate dirs). This handler never fails.usercmd
: Use the provided user defined command (-c sh:cmd=something
). This handler never fails.
Default is
remove
.link: Shortcut for
-c sh:handler=clone,reflink,hardlink,symlink
. Use this if you are on a reflink-capable system.hardlink: Shortcut for
-c sh:handler=hardlink,symlink
. Use this if you want to hardlink files, but want to fallback for duplicates that lie on different devices.symlink: Shortcut for
-c sh:handler=symlink
. Use this as last straw.
json
: Print a JSON-formatted dump of all found reports. Outputs all lint as a json document. The document is a list of dictionaries, where the first and last element is the header and the footer. Everything between are data-dictionaries.Available options:
- unique: Include unique files in the output.
- no_header=[true|false]: Print the header with metadata (default: true)
- no_footer=[true|false]: Print the footer with statistics (default: true)
- oneline=[true|false]: Print one json document per line (default: false)
This is useful if you plan to parse the output line-by-line, e.g. while
rmlint
is sill running.
This formatter is extremely useful if you’re in need of scripting more complex behaviour, that is not directly possible with rmlint’s built-in options. A very handy tool here is
jq
. Here is an example to output all original files directly from armlint
run:$ rmlint -o | json jq -r '.[1:-1][] | select(.is_original) | .path'
py
: Outputs a python script and a JSON document, just like the json formatter. The JSON document is written to.rmlint.json
, executing the script will make it read from there. This formatter is mostly intended for complex use-cases where the lint needs special handling that you define in the python script. Therefore the python script can be modified to do things standardrmlint
is not able to do easily.uniques
: Outputs all unique paths found during the run, one path per line. This is often useful for scripting purposes.Available options:
- print0: Do not put newlines between paths but zero bytes.
stamp
:Outputs a timestamp of the time
rmlint
was run. See also the--newer-than
and--newer-than-stamp
file option.Available options:
- iso8601=[true|false]: Write an ISO8601 formatted timestamps or seconds since epoch?
progressbar
: Shows a progressbar. This is meant for use with stdout or stderr [default].See also:
-g
(--progress
) for a convenience shortcut option.Available options:
- update_interval=number: Number of milliseconds to wait between updates. Higher values use less resources (default 50).
- ascii: Do not attempt to use unicode characters, which might not be supported by some terminals.
- fancy: Use a more fancy style for the progressbar.
pretty
: Shows all found items in realtime nicely colored. This formatter is activated as default.summary
: Shows counts of files and their respective size after the run. Also list all written output files.fdupes
: Prints an output similar to the popular duplicate finder fdupes(1). At first a progressbar is printed on stderr. Afterwards the found files are printed on stdout; each set of duplicates gets printed as a block separated by newlines. Originals are highlighted in green. At the bottom a summary is printed on stderr. This is mostly useful for scripts that were set up for parsing fdupes output. We recommend thejson
formatter for every other scripting purpose.Available options:
- omitfirst: Same as the
-f / --omitfirst
option infdupes(1)
. Omits the first line of each set of duplicates (i.e. the original file. - sameline: Same as the
-1 / --sameline
option infdupes(1)
. Does not print newlines between files, only a space. Newlines are printed only between sets of duplicates.
- omitfirst: Same as the
OTHER STAND-ALONE COMMANDS¶
rmlint --gui : | Start the optional graphical frontend to This will only work when The gui has its own set of options, see |
---|---|
rmlint --hash [paths...] : | |
Make |
|
rmlint --equal [paths...] : | |
Check if the paths given on the commandline all have equal content. If all
paths are equal and no other error happened, rmlint will exit with an exit
code 0. Otherwise it will exit with a nonzero exit code. All other options
can be used as normal, but note that no other formatters ( Note: This even works for directories and also in combination with paranoid
mode (pass By default this will use hashing to compare the files and/or directories. |
|
rmlint --dedupe [-r] [-v|-V] <src> <dest> : | |
If the filesystem supports files sharing physical storage between multiple
files, and if This command is similar to Running with |
|
rmlint --is-reflink [-v|-V] <file1> <file2> : | |
Tests whether
|
EXAMPLES¶
This is a collection of common use cases and other tricks:
Check the current working directory for duplicates.
$ rmlint
Show a progressbar:
$ rmlint -g
Quick re-run on large datasets using different ranking criteria on second run:
$ rmlint large_dir/ # First run; writes rmlint.json
$ rmlint --replay rmlint.json large_dir -S MaD
Merge together previous runs, but prefer the originals to be from
b.json
and make sure that no files are deleted fromb.json
:$ rmlint --replay a.json // b.json -k
Search only for duplicates and duplicate directories
$ rmlint -T "df,dd" .
Compare files byte-by-byte in current directory:
$ rmlint -pp .
Find duplicates with same basename (excluding extension):
$ rmlint -e
Do more complex traversal using
find(1)
.$ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate .so files
$ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above but handles filenames with newline character in them
$ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
Limit file size range to investigate:
$ rmlint -s 2GB # Find everything >= 2GB
$ rmlint -s 0-2GB # Find everything < 2GB
Only find writable and executable files:
$ rmlint --perms wx
Reflink if possible, else hardlink duplicates to original if possible, else replace duplicate with a symbolic link:
$ rmlint -c sh:link
Inject user-defined command into shell script output:
$ rmlint -o sh -c sh:cmd='echo "original:" "$2" "is the same as" "$1"'
Use
shred
to overwrite the contents of a file fully:$ rmlint -c 'sh:cmd=shred -un 10 "$1"'
Use data as master directory. Find only duplicates in backup that are also in data. Do not delete any files in data:
$ rmlint backup // data --keep-all-tagged --must-match-tagged
Compare if the directories a b c and are equal
$ rmlint --equal a b c && echo "Files are equal" || echo "Files are not equal"
Test if two files are reflinks
$ rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files are not reflinks"
.Cache calculated checksums for next run. The checksums will be written to the extended file attributes:
$ rmlint --xattr
Produce a list of unique files in a folder:
$ rmlint -o uniques
Produce a list of files that are unique, including original files (“one of each”):
$ rmlint t -o json -o uniques:unique_files | jq -r '.[1:-1][] | select(.is_original) | .path' | sort > original_files
$ cat unique_files original_files
Sort files by a user-defined regular expression
# Always keep files with ABC or DEF in their basename, # dismiss all duplicates with tmp, temp or cache in their names # and if none of those are applicable, keep the oldest files instead. $ ./rmlint -S 'x<.*(ABC|DEF).*>X<.*(tmp|temp|cache).*>m' /some/path
Sort files by adding priorities to several user-defined regular expressions:
# Unlike the previous snippet, this one uses priorities: # Always keep files in ABC, DEF, GHI by following that particular order of # importance (ABC has a top priority), dismiss all duplicates with # tmp, temp, cache in their paths and if none of those are applicable, # keep the oldest files instead. $ rmlint -S 'r<.*ABC.*>r<.*DEF.*>r<.*GHI.*>R<.*(tmp|temp|cache).*>m' /some/path
PROBLEMS¶
- False Positives: Depending on the options you use, there is a very slight risk
of false positives (files that are erroneously detected as duplicate).
The default hash function (blake2b) is very safe but in theory it is possible for
two files to have then same hash. If you had 10^73 different files, all the same
size, then the chance of a false positive is still less than 1 in a billion.
If you’re concerned just use the
--paranoid
(-pp
) option. This will compare all the files byte-by-byte and is not much slower than blake2b (it may even be faster), although it is a lot more memory-hungry. - File modification during or after rmlint run: It is possible that a file
that
rmlint
recognized as duplicate is modified afterwards, resulting in a different file. If you use the rmlint-generated shell script to delete the duplicates, you can run it with the-p
option to do a full re-check of the duplicate against the original before it deletes the file. When using-c sh:hardlink
or-c sh:symlink
care should be taken that a modification of one file will now result in a modification of all files. This is not the case for-c sh:reflink
or-c sh:clone
. Use-c sh:link
to minimise this risk.
SEE ALSO¶
Reading the manpages of these tools might help working with rmlint
:
- find(1)
- rm(1)
- cp(1)
Extended documentation and an in-depth tutorial can be found at:
BUGS¶
If you found a bug, have a feature requests or want to say something nice, please visit https://github.com/sahib/rmlint/issues.
Please make sure to describe your problem in detail. Always include the version
of rmlint
(--version
). If you experienced a crash, please include
at least one of the following information with a debug build of rmlint
:
gdb --ex run -ex bt --args rmlint -vvv [your_options]
valgrind --leak-check=no rmlint -vvv [your_options]
You can build a debug build of rmlint
like this:
git clone git@github.com:sahib/rmlint.git
cd rmlint
scons GDB=1 DEBUG=1
sudo scons install # Optional
LICENSE¶
rmlint
is licensed under the terms of the GPLv3.
See the COPYRIGHT file that came with the source for more information.
PROGRAM AUTHORS¶
rmlint
was written by:
- Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
- Daniel <SeeSpotRun> T. 2014-2017 (https://github.com/SeeSpotRun)
Also see the http://rmlint.rtfd.org for other people that helped us.