Usage: fileuniq.py [options] [list of files and directories, default .]
This program searches for duplicate files and (optionally) eliminates
them, deleting the files or replacing them with hard or soft links.
Copyright Lorens Kockum 2009
General defaults (may be overridden by options):
- A record of found duplicates is output to screen
- Nothing is modified on disk
The order of filenames is important. If you pass dir1 dir2 as
arguments, then, all other things being equal, filenames in
the dir1 directory tree will be considered "best". This may
influence the actions taken. For example, the default is to
output the "best" filename first in the list of duplicates.
One use case is
./fileuniq.py --duplicates=symlink --strategy=keepfirst
followed by a list of directories in order of age. After
running, the oldest/first directory will have files, while
the other directories will have lots of symlinks to files in
"older" directories, and real files only if they are the first
occurrence of the file found.
Requirements:
- python (tested on 2.5.2 but earlier versions should work).
- sqlite3
- Unix filesystem semantics (hard and symbolic links, stat()...)
Not tested on anything other than Linux, not tested on filesystem
mounts with maybe different semantics.
This program is released under the Gnu Public License. If you have not
received a copy of the GPL with this script, you can download it at
http://www.gnu.org/licenses/gpl.html
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
You might lose your data! At time of Alpha release this program
has been tested by one person on one computer, one operating
system, and one set of files. You should read the program
throughly and play with the verbosity options before trusting
your data to this program -- and even then you should of course
have backups.
You should also throughly understand the implications of creating
hard or symbolic links in your directory tree.
Options:
-h, --help show this help message and exit
-v, --verbose Be verbose. Specify several times to increase
verbosity. Default verbosity is 1. Verbosity 0: No
output Verbosity 1: Only fatal errors and warnings
Verbosity 2: Note for every action executed Verbosity
3: General progress indicators Verbosity 4: debugging
files examined Verbosity 5: debugging database
-q, --quiet Decrease verbosity (set verbosity to zero). Except
for the list (when it is requested), do not generate
any output.
--database=DATABASE The only implemented possibilities are a filename
and the string ":memory:". The default is ":memory:".
If the string is ":memory:" then the database used is
sqlite3 in memory. If not, the string must be a valid
filename. If the file named does not exist, it is
created as an sqlite3 database. If the file exists, it
must be a valid sqlite3 database file. Future
implementations may permit URLs such as
mysql://user:pass@server/database
--initialize Initializes the database. This is assumed when the
database is :memory: or a file that does not exist.
--update=UPDATE "update" updates an existing or just created
database. All files derived from the command line are
stat-ed, the checksum is calculated or recalculated if
there is any reason to believe it has changed, and the
information is updated in the database. This is the
default. "noupdate" does not go through the files
derived from the command line to stat them. This can
be used when one is confident that files have not
changed and one does not wish to take the time to stat
all the files. One can for example invoke the program
with update and noop on a set of new files, and then
invoke it again with noupdate on a much larger set of
files. Note that no action will be taken based on
incorrect information since a check is forced before
any file system modification is made. "forceupdate"
forces recalculation of the stored checksums even if
there is no indication that the files have changed.
This option exists for completeness. It would probably
only be useful when one suspects corruption on files
referenced from the command line (same size and same
time of last modification, but different data), BUT
the database contains information on files not
referenced from the command line that one does not
want to modify. Otherwise one would simply delete the
database.
--extradata=EXTRADATA
This concerns files not derived from the command
line which are present since previous runs. "discard"
will remove the entries from the database. "ignore"
will ignore these entries (except that their "level"
will be set to zero). This is the default. I do not
wish to take these entries into account because I do
not think it is a good idea to work on files not
specified on the command line. If you wish to take
these entries into account you should use --update=no
and / as argument. You could update the files that you
do want to update with a previous invocation using
--duplicates=noop.
--duplicates=DUPLICATES
Choices: list, delete, symlink, hardlink, noop
Default: list This is the essential option that
describes what the program should do with all these
duplicates. The choice between duplicates (which file
is replaced by a symlink or deleted, for example, or
simply which file comes first in the list) is governed
by the --strategy specification.
--strategy=STRATEGY Choices: keepfirst, keepoldest, keeplatest,
keepmostlinks Default: keepfirst When the action is
to list duplicates, the file to be kept is first.
Oldest and latest relate to time of last modification.
--keepdifferenttimes=KEEPDIFFERENTTIMES
Choices: yes, no Default: yes This is relevant to
last modification time. Other times are not compared.
The modification time is kept in the inode, so when
one finds two files that are identical except for the
modification time, one must decide if the time is
important enough to keep multiple versions of the
data. The default is to retain all the information.
--keepdifferentusers=KEEPDIFFERENTUSERS
Choices: yes, no Default: yes The discussion is
the same as for option --keepdifferenttimes, except
that usually changing the user can have more
consequences than changing the timestamp.
--keepdifferentmodes=KEEPDIFFERENTMODES
Choices: yes, no Default: yes The discussion is
the same as for option --keepdifferenttimes, except
that usually changing the file mode can have more
consequences than changing the timestamp.