FileUniq Man Page


Usage: fileuniq.py [options] [list of files and directories, default .]

This program searches for duplicate files and (optionally) eliminates
them, deleting the files or replacing them with hard or soft links.

Copyright Lorens Kockum 2009

General defaults (may be overridden by options):

- A record of found duplicates is output to screen

- Nothing is modified on disk

The order of filenames is important. If you pass dir1 dir2 as
arguments, then, all other things being equal, filenames in
the dir1 directory tree will be considered "best". This may
influence the actions taken.  For example, the default is to
output the "best" filename first in the list of duplicates.

One use case is

      ./fileuniq.py --duplicates=symlink --strategy=keepfirst

followed by a list of directories in order of age.  After
running, the oldest/first directory will have files, while
the other directories will have lots of symlinks to files in
"older" directories, and real files only if they are the first
occurrence of the file found.

Requirements:

- python (tested on 2.5.2 but earlier versions should work).

- sqlite3

- Unix filesystem semantics (hard and symbolic links, stat()...)
  Not tested on anything other than Linux, not tested on filesystem
  mounts with maybe different semantics.

This program is released under the Gnu Public License. If you have not
received a copy of the GPL with this script, you can download it at
http://www.gnu.org/licenses/gpl.html

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You might lose your data! At time of Alpha release this program
has been tested by one person on one computer, one operating
system, and one set of files. You should read the program
throughly and play with the verbosity options before trusting
your data to this program -- and even then you should of course
have backups.

You should also throughly understand the implications of creating
hard or symbolic links in your directory tree. 

Options:
  -h, --help            show this help message and exit
  -v, --verbose           Be verbose. Specify several times to increase
                        verbosity. Default verbosity is 1.  Verbosity 0: No
                        output Verbosity 1: Only fatal errors and warnings
                        Verbosity 2: Note for every action executed Verbosity
                        3: General progress indicators Verbosity 4: debugging
                        files examined Verbosity 5: debugging database
  -q, --quiet             Decrease verbosity (set verbosity to zero). Except
                        for the list (when it is requested), do not generate
                        any output.
  --database=DATABASE     The only implemented possibilities are a filename
                        and the string ":memory:". The default is ":memory:".
                        If the string is ":memory:" then the database used is
                        sqlite3 in memory. If not, the string must be a valid
                        filename. If the file named does not exist, it is
                        created as an sqlite3 database. If the file exists, it
                        must be a valid sqlite3 database file.  Future
                        implementations may permit URLs such as
                        mysql://user:pass@server/database
  --initialize            Initializes the database. This is assumed when the
                        database is :memory: or a file that does not exist.
  --update=UPDATE         "update" updates an existing or just created
                        database. All files derived from the command line are
                        stat-ed, the checksum is calculated or recalculated if
                        there is any reason to believe it has changed, and the
                        information is updated in the database. This is the
                        default.  "noupdate" does not go through the files
                        derived from the command line to stat them. This can
                        be used when one is confident that files have not
                        changed and one does not wish to take the time to stat
                        all the files. One can for example invoke the program
                        with update and noop on a set of new files, and then
                        invoke it again with noupdate on a much larger set of
                        files. Note that no action will be taken based on
                        incorrect information since a check is forced before
                        any file system modification is made.  "forceupdate"
                        forces recalculation of the stored checksums even if
                        there is no indication that the files have changed.
                        This option exists for completeness. It would probably
                        only be useful when one suspects corruption on files
                        referenced from the command line (same size and same
                        time of last modification, but different data), BUT
                        the database contains information on files not
                        referenced from the command line that one does not
                        want to modify. Otherwise one would simply delete the
                        database.
  --extradata=EXTRADATA
                          This concerns files not derived from the command
                        line which are present since previous runs.  "discard"
                        will remove the entries from the database.  "ignore"
                        will ignore these entries (except that their "level"
                        will be set to zero). This is the default.  I do not
                        wish to take these entries into account because I do
                        not think it is a good idea to work on files not
                        specified on the command line. If you wish to take
                        these entries into account you should use --update=no
                        and / as argument. You could update the files that you
                        do want to update with a previous invocation using
                        --duplicates=noop.
  --duplicates=DUPLICATES
                          Choices: list, delete, symlink, hardlink, noop
                        Default: list  This is the essential option that
                        describes what the program should do with all these
                        duplicates. The choice between duplicates (which file
                        is replaced by a symlink or deleted, for example, or
                        simply which file comes first in the list) is governed
                        by the --strategy specification.
  --strategy=STRATEGY     Choices: keepfirst, keepoldest, keeplatest,
                        keepmostlinks  Default: keepfirst  When the action is
                        to list duplicates, the file to be kept is first.
                        Oldest and latest relate to time of last modification.
  --keepdifferenttimes=KEEPDIFFERENTTIMES
                          Choices: yes, no  Default: yes  This is relevant to
                        last modification time. Other times are not compared.
                        The modification time is kept in the inode, so when
                        one finds two files that are identical except for the
                        modification time, one must decide if the time is
                        important enough to keep multiple versions of the
                        data. The default is to retain all the information.
  --keepdifferentusers=KEEPDIFFERENTUSERS
                          Choices: yes, no  Default: yes  The discussion is
                        the same as for option --keepdifferenttimes, except
                        that usually changing the user can have more
                        consequences than changing the timestamp.
  --keepdifferentmodes=KEEPDIFFERENTMODES
                          Choices: yes, no  Default: yes  The discussion is
                        the same as for option --keepdifferenttimes, except
                        that usually changing the file mode can have more
                        consequences than changing the timestamp.