Usage: fileuniq.py [options] [list of files and directories, default .] This program searches for duplicate files and (optionally) eliminates them, deleting the files or replacing them with hard or soft links. Copyright Lorens Kockum 2009 General defaults (may be overridden by options): - A record of found duplicates is output to screen - Nothing is modified on disk The order of filenames is important. If you pass dir1 dir2 as arguments, then, all other things being equal, filenames in the dir1 directory tree will be considered "best". This may influence the actions taken. For example, the default is to output the "best" filename first in the list of duplicates. One use case is ./fileuniq.py --duplicates=symlink --strategy=keepfirst followed by a list of directories in order of age. After running, the oldest/first directory will have files, while the other directories will have lots of symlinks to files in "older" directories, and real files only if they are the first occurrence of the file found. Requirements: - python (tested on 2.5.2 but earlier versions should work). - sqlite3 - Unix filesystem semantics (hard and symbolic links, stat()...) Not tested on anything other than Linux, not tested on filesystem mounts with maybe different semantics. This program is released under the Gnu Public License. If you have not received a copy of the GPL with this script, you can download it at http://www.gnu.org/licenses/gpl.html This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. You might lose your data! At time of Alpha release this program has been tested by one person on one computer, one operating system, and one set of files. You should read the program throughly and play with the verbosity options before trusting your data to this program -- and even then you should of course have backups. You should also throughly understand the implications of creating hard or symbolic links in your directory tree. Options: -h, --help show this help message and exit -v, --verbose Be verbose. Specify several times to increase verbosity. Default verbosity is 1. Verbosity 0: No output Verbosity 1: Only fatal errors and warnings Verbosity 2: Note for every action executed Verbosity 3: General progress indicators Verbosity 4: debugging files examined Verbosity 5: debugging database -q, --quiet Decrease verbosity (set verbosity to zero). Except for the list (when it is requested), do not generate any output. --database=DATABASE The only implemented possibilities are a filename and the string ":memory:". The default is ":memory:". If the string is ":memory:" then the database used is sqlite3 in memory. If not, the string must be a valid filename. If the file named does not exist, it is created as an sqlite3 database. If the file exists, it must be a valid sqlite3 database file. Future implementations may permit URLs such as mysql://user:pass@server/database --initialize Initializes the database. This is assumed when the database is :memory: or a file that does not exist. --update=UPDATE "update" updates an existing or just created database. All files derived from the command line are stat-ed, the checksum is calculated or recalculated if there is any reason to believe it has changed, and the information is updated in the database. This is the default. "noupdate" does not go through the files derived from the command line to stat them. This can be used when one is confident that files have not changed and one does not wish to take the time to stat all the files. One can for example invoke the program with update and noop on a set of new files, and then invoke it again with noupdate on a much larger set of files. Note that no action will be taken based on incorrect information since a check is forced before any file system modification is made. "forceupdate" forces recalculation of the stored checksums even if there is no indication that the files have changed. This option exists for completeness. It would probably only be useful when one suspects corruption on files referenced from the command line (same size and same time of last modification, but different data), BUT the database contains information on files not referenced from the command line that one does not want to modify. Otherwise one would simply delete the database. --extradata=EXTRADATA This concerns files not derived from the command line which are present since previous runs. "discard" will remove the entries from the database. "ignore" will ignore these entries (except that their "level" will be set to zero). This is the default. I do not wish to take these entries into account because I do not think it is a good idea to work on files not specified on the command line. If you wish to take these entries into account you should use --update=no and / as argument. You could update the files that you do want to update with a previous invocation using --duplicates=noop. --duplicates=DUPLICATES Choices: list, delete, symlink, hardlink, noop Default: list This is the essential option that describes what the program should do with all these duplicates. The choice between duplicates (which file is replaced by a symlink or deleted, for example, or simply which file comes first in the list) is governed by the --strategy specification. --strategy=STRATEGY Choices: keepfirst, keepoldest, keeplatest, keepmostlinks Default: keepfirst When the action is to list duplicates, the file to be kept is first. Oldest and latest relate to time of last modification. --keepdifferenttimes=KEEPDIFFERENTTIMES Choices: yes, no Default: yes This is relevant to last modification time. Other times are not compared. The modification time is kept in the inode, so when one finds two files that are identical except for the modification time, one must decide if the time is important enough to keep multiple versions of the data. The default is to retain all the information. --keepdifferentusers=KEEPDIFFERENTUSERS Choices: yes, no Default: yes The discussion is the same as for option --keepdifferenttimes, except that usually changing the user can have more consequences than changing the timestamp. --keepdifferentmodes=KEEPDIFFERENTMODES Choices: yes, no Default: yes The discussion is the same as for option --keepdifferenttimes, except that usually changing the file mode can have more consequences than changing the timestamp.