Recognizing duplicate inodes?

kopia has been great in my usage so far but I have just started it on a problem task: making a backup of a time machine backup drive. Time machine replicates the filesystem over and over again in directories named for the date of the snapshot, with duplicate files being hard links.

If kopia is smart enough to recognize these duplicate inodes and avoid rehashing them, the backup will be very fast, because there’s only 1 terabyte of data on the drive and virtually all of its content is already in the kopia blob store. I know this because there is already a snapshot of the latest time machine backup in the store, so only files that were heavily modified or deleted will create any new blobs.

So, the question is: will kopia skip even checksumming duplicate inodes, or will it wind up checksumming every file repeatedly? On this particular drive, there are 76 time machine snapshots, meaning it’s the difference between hashing about 1.1 terabytes and hashing 76 terabytes.

Thanks

Kopia generally avoids rehashing files if the attributes (name, modification time and size) match the last snapshot - it will simply assume that the file has not changed (this behavior can be overridden).

The logic today relies on walking current directory (local filesystem) and previous snapshot (remote filesystem) in parallel and compares file attributes side by side. Actually - multiple previous snapshots are supported and they are all consulted in parallel and if any of them matches, Kopia avoids hashing (this helps with performance of previous incomplete snapshots).

In principle we could support additional “previous” snapshot references, either specified on the command line or via some policy/rule. For example I think the pattern where you have a parent directory with time-based subdirectories could be recognized speed this up:

Consider:

/some/dir/2021-04-15
/some/dir/2021-04-16
/some/dir/2021-04-17
/some/dir/2021-04-18
/some/dir/2021-04-19
/some/dir/2021-04-20
/some/dir/2021-04-21

When snapshotting a directory - say /some/dir/2021-04-19 - Kopia could detect the pattern and use the lexicographically-previous entry or entries (which would be /some/dir/2021-04-18, /some/dir/2021-04-17, /some/dir/2021-04-16, …) assuming they are similar at all. This should produce big performance improvement.

Sounds like a good experiment to try. The user experience is the hardest and very important to go get right. l I’ll be happy to discuss this further and if somebody is interested in implementing that, help with codebase orientation and code reviews.

I think for my case it would be much simpler to s/name/inode/ in your approach. If the inode, modification time, and size match, that is as strong a signal as name/modification time/size, as it is difficult at best to rename a file without touching its modification time.

Is there a reason to avoid using inode numbers instead of names?

To expand a little bit, consider:

$ stat -f "%i %m %z" /Volumes/Proraid\ Backup/Backups.backupdb/.../2010-07-19-063223/Home/iPhoto.app/Contents/Info.plist
3961451 1158969800 2420
$ stat -f "%i %m %z" /Volumes/Proraid\ Backup/Backups.backupdb/.../2010-07-19-063223/Home/iPhoto.app/Contents/Info.plist
3961451 1158969800 2420

In the above example, it’s true that the filenames are very similar. Yet if the filenames were identical and the inodes were different, I’d rather hash again because I know the underlying filesystem does not in fact have the same file I saw before (for example, it may have been restored by someone else’s backup program; so why should I assume that backup worked and is the same hash kopia already has?). I’d argue that different inodes are a strong signal that the underlying content could be corrupt as it has definitely been rewritten, while different filenames with the same inode are a sign that the user knows how to use ln to make hard links.

Thanks

Caching by inode within the same upload session should be possible (keep track of mapping from inode->(metadata,object ID)). I’ll be happy to review PRs.

OK for my purposes ‘within the same upload session’ would be sufficient and an important improvement. Where would such a cache logically hook into the code?

Also I would like to mostly retract what I said above about renaming triggering mtime change, as it only will if the program in question rewrites the file. Desirable outcomes of having this inode->(metadata,object id) cache persisted include being able to deal with directories being moved around wholesale or files being renamed by tools like mv. However, if we’re uncomfortable persisting it, that matters less to me as the operation I currently have running will take about 28 days or more at present speeds which seems like enough time to fix up the simple patch.

Finally, would you consider adding inode to the metadata for cache hits generally? i.e. name/inode/size/mtime instead of just name/size/mtime? I am not really comfortable with the idea that if I tar/untar a directory repeatedly kopia will not re-hash the rewritten tree. I would expect it to.

I’m maybe also confused about the build process. I cloned the repo and ran make and got:

$ ls dist/testing_darwin_amd64/
kopia.exe		testingaction.exe

Turns out these files are MacOS CLI executables but the naming really surprised me. Are they meant to be named .exe and all is well, or is my environment misconfigured in some way? Tests looked like they passed, though I didn’t watch the output too closely.

As near as I can tell, you’re suggesting modifying findCachedEntry in upload.go to allow a cache hit on a new in-memory cache. Is that the spot you mean?

One possible approach: in local_fs.go:newEntry record an inode → names map (call it aliases or something like that). Then in findCachedEntry look not only for the name of the file as it does today, but also for the names of all the aliases of the file.

This doesn’t seem particularly elegant to me, but does seem similar to the seemingly global table of Entries that is sorted by name to enable finding prior entries by filename; I essentially want a mapping from inodes to those names so that the lookup being done in findCachedEntry can work as before without perturbing the code too much.

It also introduces a new call to os.Stat, which as far as I can see the current code never does.

BTW. When developing make install is the command I use the most - it installs kopia to ~/go/bin

Alternatively make kopia will just place the command in dist/kopia_darwin_universal/kopia

The exe you found was a special testing-only binary with some test hooks enabled (and it’s called .exe because on windows they must be called that and I was lazy).

Happy to discuss in a PR if you prefer (over the weekend is when I have most time to work on Kopia)

I can make a PR once I have a vague idea of how you want it to work. Just suggesting which functions you’d prefer to see modified would be plenty of direction. I’ll check back on the weekend for updates.

BTW, it’s important that this considers different OSes and filesystems, including ones in which inodes are non-existent (Windows) or non-durable (kept only in RAM, like FUSE filesystems). The metadata structure will out of necessity be a least common denominator with OS-specific extensions.

I was thinking about this approach a more, and I would try the following approach:

  1. Define optional interface (HasInode) with one method GetInode() and implement this on localfs Entry only (so inode numbers would not be persisted in the repository).

  2. In each upload session we’d have a map/cache “(dev,rdev,inodeNumber) → *snapshot.DirEntry”

When we encounter an entry, we first check previous snapshot cache, if it’s there - reuse it.
Next (optionally, flag-protected), we’d consult the inode map and if an entry with the same (dev,rdev,inode) is there and size/mtime still matches, we reuse the dir entry.

The big problem with this approach is memory bloat, we’d necessarily keep all directory entries in RAM for the duration of the upload, that’s why this is not applicable always and should be opt-in.

There are several approaches to keep memory usage in check - we can limit the inode caching behavior to N directory levels, have some kinds of bounded LRU cache where if we run out of memory oldest items will start dropping out, etc.

Overall, it’s an interesting and non-trivial project. I’d be glad to accept this into Kopia and happy to help develop it.