Developer's Guide

This guide is targeted to people that want to write new features or fix bugs in rmlint.

Bugs

Please use the issue tracker to post and discuss bugs and features:

Philosophy

We try to adhere to some principles when adding features:

  • Try to stay compatible to standard unix' tools and ideas.
  • Try to stay out of the users way and never be interactive.
  • Try to make scripting as easy as possible.
  • Never make rmlint modify the filesystem itself, only produce output to let the user easily do it.

Also keep this in mind, if you want to make a feature request.

Making contributions

The code is hosted on GitHub, therefore our preferred way of receiving patches is using GitHub's pull requests (normal git pull requests are okay too of course).

Note

origin/master should always contain working software. Base your patches and pull requests always on origin/develop.

Here's a short step-by-step:

  1. Fork it.
  2. Create a branch from develop. (git checkout develop && git checkout -b my_feature)
  3. Commit your changes. (git commit -am "Fixed it all.")
  4. Check if your commit message is good. (If not: git commit --amend)
  5. Push to the branch (git push origin my_feature)
  6. Open a Pull Request.
  7. Enjoy a refreshing Tea and wait until we get back to you.

Here are some other things to check before submitting your contribution:

  • Does your code look alien to the other code? Is the style the same?
  • Do all tests run? (Simply run nosetests to find out) Also after opening the pull request, your code will be checked via TravisCI.
  • Is your commit message descriptive?
  • Is rmlint running okay inside of valgrind (i.e. no leaks and no memory violations)?

For language-translations/updates it is also okay to send the .po files via mail at sahib@online.de, since not every translator is necessarily a software developer.

Buildsystem Helpers

Environement Variables

CFLAGS:Extra flags passed to the compiler.
LDFLAGS:Extra flags passed to the linker.
CC:Which compiler to use?
# Use clang and enable profiling, verbose build and enable debugging
CC=clang CFLAGS='-pg' LDFLAGS='-pg' scons VERBOSE=1 DEBUG=1

Variables

DEBUG:Enable debugging symbols for rmlint. This should always be enabled during developement. Backtraces wouldn't be useful elsewhise.
VERBOSE:Print the exact compiler and linker commands. Useful for troubleshooting build errors.

Arguments

--prefix:Change the installation prefix. By default this is /usr, but some users might prefer /usr/local or /opt.
--actual-prefix:
 This is mainly useful for packagers. The rmlint binary knows where it is installed (which is needed to set e.g. the path to the gettext files). When installing a package, most of the time the build is installed to a local test environment first before being packed to /usr. In this case the --prefix would be set to the path of the temporary build env, while --actual-prefix would be set to /usr.
--without-libelf:
 Do not link with libelf, which is needed for nonstripped binary detection.
--without-blkid:
 Do not link with libblkid, which is needed to differentiate between normal rotational harddisks and non-rotational disks.
--without-fiemap:
 Do not attempt to use the FIEMAP ioctl(2).
--without-gettext:
 Do not link with libintl and do not compile any message catalogs.

All --without-* options come with a --with-* option that inverses its effect. By default rmlint is built with all features on the system, so you do not need to specify any --with-* option normally.

Notable targets

install:

Install all program parts system-wide.

config:

Print a summary of all features that will be compiled and what the environment looks like.

man:

Build the manpage.

docs:

Build the onlice html docs (which you are reading now).

test:

Build the tests (requires python and nosetest installed). Optionally valgrind can be installed to run the tests through valgrind:

$ USE_VALGRIND=1 nosetests  # or nosetests-3.3, python3 needed.
xgettext:

Extract a gettext .pot template from the source.

dist:

Build a tarball suitable for release. Save it under rmlint-$major-$minor-$patch.tar.gz.

release:

Same as dist, but reads the .version file and replaces the current version in the files that are not built by scons.

Sourcecode layout

  • All C-source lives in src, the file names should be self explanatory.
  • All documentation is inside docs.
  • All translation stuff should go to po.
  • All packaging should be done in pkg/<distribution>.
  • Tests are written in Python and live in tests.

Hashfunctions

Here is a short comparasion of the existing hashfunctions in rmlint (linear scale). For reference: Those plots were rendered with these sources - which are very ugly, sorry.

If you want to add new hashfunctions, you should have some arguments why it is valueable and possiblye even benchmark it with the above scripts to see if it's really that much faster.

Also keep in mind that most of the time the hashfunction is not the bottleneck.

Optimizations

For sake of overview, here is a short list of optimizations implemented in rmlint:

Obvious ones

  • Do not compare each file with each other by content, use a hashfunction to reduce comparison overhead drastically (introduces possibility of collisions though).
  • Only compare files of same size with each other.
  • Use incremental hashing, i.e. hash block-wise each size group and stop as soon a difference occurs or the file is read fully.

Subtle ones

  • Check only executable files to be non-stripped binaries.
  • Use preadv(2) based reading for small speeedups.
  • Every thread in rmlint is shared, so only few calls to pthread_create are made.

Insane ones

  • Check the device ID of each file to see if it on a rotational (normal hard disks) or on a non-rotational device (like a SSD). On the latter the file might be processed by several threads.
  • Use fiemap ioctl(2) to analyze the harddisk layout of each file, so each block can read it in perfect order on a rotational device.
  • Use a common buffer pool for IO buffers.
  • Use only one hashsum per group of same-sized files.
  • Implement paranoia check as hash sum, so large chunks of the file are read and compared at one time. The total memory used for this can be configured by --max-paranoid-ram.