rmlint [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...] [-] [OPTIONS]
rmlint
finds space waste and other broken things on your filesystem.
Types of waste include: * Duplicate files and directories. * Nonstripped Binaries (Binaries with debug symbols). * Broken links. * Empty files and directories. * Files with broken user or group id.
rmlint
will not delete any files. It does however produce executable output
(for example a shell script) to help you delete the files if you want to.
In order to find the lint, rmlint
is given one or more directories to traverse.
If no directories or files were given, the current working directory is assumed.
By default, rmlint
will ignore hidden files and will not follow symlinks (see
traversal options below). rmlint
will first find "other lint" and then search
the remaining files for duplicates.
Duplicate sets will be displayed as an original and one or more duplicates. You
can set criteria for how rmlint
chooses using the -S option (by default it
chooses the first-named path on the command line, or if that is equal then the
oldest file based on mtime). You can also specify that certain paths only contain
originals by naming the path after the special path separator //.
Examples are given at the end of this manual.
-T --types="list" (default: defaults): | |
---|---|
WARNING: It is good practice to enclose the description in quotes. In obscure cases argument parsing might fail in weird ways. |
|
-o --output=spec / -O --add-output=spec (default: -o sh:rmlint.sh -o pretty:stdout -o summary:stdout): | |
For a list of formatters and their options, refer to the Formatters section below. |
|
-c --config=spec[=value] (default: none): | |
Configure a format. This option can be used to fine-tune the behaviour of the existing formatters. See the Formatters section for details on the available keys. If the value is omitted it is set to a true value. |
|
-z --perms[=[rwx]] (default: no check): | |
Only look into file if it is readable, writable or executable by the current user. Which one of the can be given as argument as one of "rwx". If no argument is given, "rw" is assumed. Note that r does basically
nothing user-visible since By default this check is not done. |
|
-a --algorithm=name (default: sha1): | |
spooky, city, murmur, xxhash, md5, sha1, sha256, sha512, farmhash.
|
|
-p --paranoid / -P --less-paranoid (default): | |
Increase or decrease the paranoia of
|
|
-v --loud / -V --quiet : | |
Increase or decrease the verbosity. You can pass these options several
times. This only affects |
|
-g --progress / -G --no-progress (default): | |
Convenience shortcut for
|
|
-D --merge-directories (default: disabled): | |
Makes rmlint use a special mode where all found duplicates are collected and
checked if whole directory trees are duplicates. Use with caution: You
always should make sure that the investigated directory is not modified
during Output is deferred until all duplicates were found. Duplicate directories are printed first, followed by any remaining duplicate files. --rank-by applies for directories too, but 'p' or 'P' (path index) has no defined (i.e. useful) meaning. Sorting takes only place when the number of preferred files in the directory differs. NOTES:
|
|
-y --sort-by=order (default: none): | |
During output, sort the found duplicate groups by criteria described by order. order is a string that may consist of one or more of the following letters:
The letter may also be written uppercase (similiar to |
|
--gui : | Start the optional graphical frontend to This will only work when
|
--hash : |
|
-w --with-color (default) / -W --no-with-color : | |
Use color escapes for pretty output or disable them.
If you pipe rmlints output to a file |
|
-h --help / -H --show-man : | |
Show a shorter reference help text ( |
|
--version : | Print the version of rmlint. Includes git revision and compile time features. |
-s --size=range (default: all): | |
---|---|
Only consider files in a certain size range. The format of range is min-max, where both ends can be specified as a number with an optional multiplier. The available multipliers are:
The size format is about the same as dd(1) uses. Example: "100KB-2M". It's also possible to specify only one size. In this case the size is
interpreted as "bigger than this size". If you want to to filter for files
up to this size you can add a |
|
-d --max-depth=depth (default: INF): | |
Only recurse up to this depth. A depth of 1 would disable recursion and is equivalent to a directory listing. |
|
-l --hardlinked (default) / -L --no-hardlinked : | |
Whether to report hardlinked files as duplicates. |
|
-f --followlinks / -F --no-followlinks / -@ --see-symlinks (default): | |
|
|
-x --no-crossdev / -X --crossdev (default): | |
Stay always on the same device ( |
|
-r --hidden / -R --no-hidden (default) / --partial-hidden : | |
Also traverse hidden directories? This is often not a good idea, since
directories like
|
|
-b --match-basename : | |
Only consider those files as dupes that have the same basename. See also
|
|
-B --unmatched-basename : | |
Only consider those files as dupes that do not share the same basename.
See also |
|
-e --match-with-extension / -E --no-match-with-extension (default): | |
Only consider those files as dupes that have the same file extension. For
example two photos would only match if they are a |
|
-i --match-without-extension / -I --no-match-without-extension (default): | |
Only consider those files as dupes that have the same basename minus the file
extension. For example: |
|
-n --newer-than-stamp=<timestamp_filename> / -N --newer-than=<iso8601_timestamp_or_unix_timestamp> : | |
Only consider files (and their size siblings for duplicates) newer than a certain modification time (mtime). The age barrier may be given as seconds since the epoch or as ISO8601-Timestamp like 2014-09-08T00:12:32+0200.
than
Note: you can make rmlint write out a compatible timestamp with:
|
-k --keep-all-tagged / -K --keep-all-untagged : | |
---|---|
Don't delete any duplicates that are in tagged paths ( |
|
-m --must-match-tagged / -M --must-match-untagged : | |
Only look for duplicates of which at least one is in one of the tagged paths. (Paths that were named after //). |
|
-S --rank-by=criteria (default: pm): | |
Sort the files in a group of duplicates by one or more criteria.
Alphabetical sort will only use the basename of the file and ignore it's case.
One can have multiple criteria, e.g.: For more fine grained control, it is possible to give a regular expression
to sort by. This can be useful when you know a common fact that identifies
original paths (like a path component being To use the regular expression you simply enclose it in the criteria string
by adding <REGULAR_EXPRESSIOn> after specifying r or x. Example: Warning: When using r or x, try to make your regex to be as specific
as possible! Good practice includes adding a Tip: l is useful for files like file.mp3 vs file.1.mp3 or file.mp3.bak. |
--replay [path.json] : | |
---|---|
Read an existing json file and re-output it. This is very useful if you want
to reformat, refilter or resort the output you got from an previous run.
Usage is simple: Just pass If you want to view only the duplicates of certain subdirectories, just pass them on the commandline as usual. If By design, some options will not have any effect. Those are:
|
|
--xattr-read / --xattr-write / --xattr-clear : | |
Read or write cached checksums from the extended file attributes. This feature can be used to speed up consecutive runs. CAUTION: This is a potentially unsafe feature. The cache file might be
changed accidentally, potentially causing NOTE: The speedup you may experience may vary wildly. In some cases the parsing of the json file might take longer than the actual hashing. Also, the cached json file will not be of use when doing many modifications between the runs, i.e. causing an update of mtime on most files. This feature is mostly intended for large datasets in order to prevent the re-hashing of large files. NOTE: Many tools do not support extended file attributes properly, resulting in a loss of the information when copying the file or editing it. Also, this is a linux specific feature that works not on all filesystems and only if you have write permissions to the file. Usage example: $ rmlint large_file_cluster/ -U --xattr-write # first run.
$ rmlint large_file_cluster/ --xattr-read # second run.
|
|
-U --write-unfinished : | |
Include files in output that have not been hashed fully (i.e. files that do
not appear to have a duplicate). This is mainly useful in conjunction with
|
-t --threads=N (default: 16): | |
---|---|
The number of threads to use during file tree traversal and hashing.
|
|
-u --max-paranoid-mem=size : | |
Apply a maximum number of bytes to use for --paranoid.
The |
|
-q --clamp-low=[fac.tor|percent%|offset] (default: 0) / -Q --clamp-top=[fac.tor|percent%|offset] (default: 1.0): | |
The argument can be either passed as factor (a number with a Only look at the content of files in the range of from This is useful in a few cases where a file consists of a constant sized header or footer. With this option you can just compare the data in between. Also it might be useful for approximate comparison where it suffices when the file is the same in the middle part. |
|
--with-fiemap (default) / --without-fiemap : | |
Enable or disable reading the file extents on rotational disk in order to optimize disk access patterns. |
csv
: Output all found lint as comma-separated-value list.
Available options:
sh
: Output all found lint as shell script This formatter is activatedas default.
Available options:
cmd: Specify a user defined command to run on duplicates.
The command can be any valid /bin/sh
-expression. The duplicate
path and original path can be accessed via "$1"
and "$2"
.
The command will be written to the user_command
function in the
sh
-file produced by rmlint.
handler Define a comma separated list of handlers to try on duplicate files in that given order until one handler succeeds. Handlers are just the name of a way of getting rid of the file and can be any of the following:
clone
: btrfs
only. Try to clone both files with the
BTRFS_IOC_FILE_EXTENT_SAME ioctl(3p)
. This will physically delete
duplicate extents. Needs at least kernel 4.2.reflink
: Try to reflink the duplicate file to the original. See also
--reflink
in man 1 cp
. Fails if the filesystem does not support
it.hardlink
: Replace the duplicate file with a hardlink to the original
file. Fails if both files are not on the same partition.symlink
: Tries to replace the duplicate file with a symbolic link to
the original. Never fails.remove
: Remove the file using rm -rf
. (-r
for duplicate dirs).
Never fails.usercmd
: Use the provided user defined command (-c
sh:cmd=something
). Never fails.Default is remove
.
link: Shortcut for -c sh:clone,reflink,hardlink,symlink
.
hardlink: Shortcut for -c sh:hardlink,symlink
.
symlink: Shortcut for -c sh:symlink
.
json
: Print a JSON-formatted dump of all found reports.
Outputs all finds as a json document. The document is a list of dictionaries,
where the first and last element is the header and the footer respectively,
everything between are data-dictionaries.
Available options:
py
: Outputs a python script and a JSON document, just like the json formatter.
The JSON document is written to .rmlint.json
, executing the script will
make it read from there. This formatter is mostly intented for complex use-cases
where the lint needs special handling. Therefore the python script can be modified
to do things standard rmlint
is not able to do easily.
stamp
:
Outputs a timestamp of the time rmlint
was run.
Available options:
progressbar
: Shows a progressbar. This is meant for use with stdout or
stderr [default].
See also: -g
(--progress
) for a convenience shortcut option.
Available options:
pretty
: Shows all found items in realtime nicely colored. This formatter
is activated as default.
summary
: Shows counts of files and their respective size after the run.
Also list all written output files.
fdupes
: Prints an output similar to the popular duplicate finder
fdupes(1). At first a progressbar is printed on stderr. Afterwards the
found files are printed on stdout; each set of duplicates gets printed as a
block separated by newlines. Originals are highlighted in green. At the bottom
a summary is printed on stderr. This is mostly useful for scripts that were
set up for parsing fdupes output. We recommend the json
formatter for every other
scripting purpose.
Available options:
-f / --omitfirst
option in fdupes(1)
. Omits the
first line of each set of duplicates (i.e. the original file.-1 / --sameline
option in fdupes(1)
. Does not
print newlines between files, only a space. Newlines are printed only between
sets of duplicates.This is a collection of common usecases and other tricks:
Check the current working directory for duplicates.
$ rmlint
Show a progressbar:
$ rmlint -g
Quick re-run on large datasets using different ranking criteria on second run:
$ rmlint large_dir/ # First run; writes rmlint.json
$ rmlint --replay rmlint.json large_dir -S MaD
Search only for duplicates and duplicate directories
$ rmlint -T "df,dd" .
Compare files byte-by-byte in current directory:
$ rmlint -pp .
Find duplicates with same basename (excluding extension):
$ rmlint -e
Do more complex traversal using find(1)
.
$ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate .so files
$ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
Limit file size range to investigate:
$ rmlint -s 2GB # Find everything >= 2GB
$ rmlint -s 0-2GB # Find everything < 2GB
Only find writable and executable files:
$ rmlint --perms wx
Reflink on btrfs, else try to hardlink duplicates to original. If that does not work, replace duplicate with a symbolic link:
$ rmlint -c sh:link
Inject user-defined command into shell script output:
$ ./rmlint -o sh -c sh:cmd='echo "original:" "$2" "is the same as" "$1"'
Use data as master directory. Find only duplicates in backup that are also in data. Do not delete any files in data:
$ rmlint backup // data --keep-all-tagged --must-match-tagged
--paranoid
(-pp
)
option. This will compare all the files byte-by-byte and is not much slower than SHA1.rmlint
recognized as duplicate is modified afterwards, resulting in a
different file. If you use the rmlint-generated shell script to delete the duplicates,
you can run it with the -p
option to do a full re-check of the duplicate against
the original before it deletes the file.If you found a bug, have a feature requests or want to say something nice, please visit https://github.com/sahib/rmlint/issues.
Please make sure to describe your problem in detail. Always include the version
of rmlint
(--version
). If you experienced a crash, please include
at least one of the following information with a debug build of rmlint
:
gdb --ex run -ex bt --args rmlint -vvv [your_options]
valgrind --leak-check=no rmlint -vvv [your_options]
You can build a debug build of rmlint
like this:
git clone git@github.com:sahib/rmlint.git
cd rmlint
scons DEBUG=1
sudo scons install # Optional
rmlint
is licensed under the terms of the GPLv3.
See the COPYRIGHT file that came with the source for more information.
rmlint
was written by:
Also see the http://rmlint.rtfd.org for other people that helped us.
If you consider a donation you can use Flattr or buy us a beer if we meet: