Force hashing of specific subdirectory

I’ve configured a repository and a policy from within Kopia UI to snapshot a large directory (a few hundred GB) regularly. It has been working wonderfully.

Within those files, there is a small subdirectory of programmatically-generated files whose contents will periodically change, but their modification timestamps are preserved for good reasons. Kopia is not including those changes in snapshots due to the unchanged timestamps.

Is there any method or trick I can use to force Kopia to always hash that specific directory, ideally without affecting the other files in the repository?

I saw a --force-hash option in the documentation, but I’m not sure if/how I could implement that for a specific subdirectory while preserving the policy options. I am capable of writing my own scripts to invoke the CLI, rather than use the UI, if necessary.

Any tips or advice would be greatly appreciated!

I guess you would perhaps skip this folder on the “normal run” with a .kopiaignore on one of the parent dirs to it, then do a specific run of only this folder with --force-hash=100 set so it always checks the contents of those files.

1 Like

Thanks! After reading more about how policies work (I was unsure about policy inheritance), that does sound like a possible solution.

However, my concern is that this would create two separate snapshots each run which would affect retention as well as make it more difficult to browse and restore. Is there a way around that?

Would be an interesting finesse if you could selectively clear the local cache for this subdir, so kopia would think it ran over it for the first time, then while running over the dir it finds all the common parts that already are uploaded. If so, your script could just ask for a selective clear and then run as usual.

Setting force hash per directory using policies shouldn’t be too hard to implement. Patches welcome.

Blockquote

I think it’s more concerning that the original poster was the one who identified changed files, that weren’t backed up. That’s not ideal to be honest.

That’s how most file sync tools work by default. It is assumed that unchanged metadata (modification time) implies unchanged content. Standard file system operations will ensure that, only if you explicitly change modification time or use some low level file manipulation tools that can not be the case. For such caes you can force hashing but that’s really expensive for any appreciable data sizes. Selective forced hashing on a per directory basis should strike the right balance, in my opinion

If you edit a files content and then reset changed/modified-time back to the same microsecond as it had before, you are purposefully using the OS to trick “some program”, while at the same time tricking more or less any backup software. I would bet 99.9% of them will trust that if you have a file with same name, exactly the same date and same length, is it unchanged since last time you made a backup. That is the basis for more or less all incremental/differential backups, and the only “fix” would be to always do full backups and always checksum/encrypt/compress and/or send ALL data to the remote side.

Since in-place editing of file content without altering changed/modified-date is really uncommon and rather hard, this situation is mostly a non-problem. If you deliberately decided to have the OS and the filesystem lie to programs that ask for correct metadata about the file, then you get this behaviour.

I do not question the motives for doing this kind of trickery against whatever other program that “needs” files to pretend to be untouched while actually having altered content, but when you go down that road, you should be prepared to pay the “cost” of doing so, in terms of perhaps having to have two separate backup policies or two runs, one normal incremental that works normally and one specific to this directory that pay the extra cost of running over each file over and over at each backup.