Today I learned of the MIT-licenced fdupes by Adrian Lopez that performs hashed and byte-for-byte file comparisons and presents lists of duplicated files. It works well, and you should use it. A big shoutout to ehaupt@ for maintaining the FreeBSD port, and ef at Bonn University for maintaining on pkgsrc.
On my Mac I use the excellent open source dupeGuru GUI application, but I had a need to find duplicates across terabytes of data on one of my FreeBSD microservers over the weekend. I wanted a tool that I could easily run in a detached screen session, and fdupes fit the bill like a buff platypus. What?
I took a ZFS snapshot of my dataset in case things went pear-shaped and I needed to roll back, then set it to auto-delete duplicates in a my target directory. Substitute your pool, dataset, and directory as required:
# zfs snapshot pool/tank@backup $ echo "THIS WILL DELETE DATA" $ fdupes -r -d -N $directory
If you just want it to identify duplicates:
$ fdupes -r $directory > dupes.log
Or if you want it to prompt you as it finds them:
$ fdupes -r -d $directory
Someone will probably tell me that ZFS has deduping, but it’s not applicable in this case. This was just a quick and dirty job to clean up some recursively rsync’d mess that I did while half-asleep; and I just use lz4 compression for all my pools now anyway.
I could have cleaned it myself, but why not let The Machine™ do it?