Making maintenance extra safe wrt clients suspended while creating a snapshot

I have a shared repository where one of the clients is a laptop that sometimes stays suspended for a few days. This client (“A”) runs periodic (daily) snapshots. A different machine (“B”, which is regularly on-line) is in charge of repository maintenance.

According to my undestanding of maintenance (based on the docs, this forum, and reading the code briefly), the following scenario could happen:

If client A starts creating a snapshot with a lot of new data, suspends (some time after the latest checkpoint) and stays offline for a few days, B will run snapshot-gc and mark the last contents written by A as unused. If A doesn’t wake up and also write the next checkpoint before B runs full maintenance again (24 hours by default), the next snapshot-gc will delete these contents. This will render A’s snapshot incomplete and unusable.

  1. Can this really happen, or is there something in the logic that prevents it?
  2. Will A detect the issue while creating the affected snapshot and report an error, or will the snapshot finish as if nothing has happened? (I think B will eventually detect the missing contents during maintenance.)
  3. Will the next snapshot created by A re-write the missing contents? Will this at least make the next snapshot complete, or will it even “heal” the originally broken snapshot?

How can one ask Kopia for maintenance --safety=extra to prevent this issue and/or mitigate its impact?

  1. I guess making A the maintenance owner would help, but that’s suboptimal for many reasons.
  2. Is maintenance set --full-interval=240h enough to prevent this unless A stays offline for 10 days?
  3. Anything else that does not require a really long maintenance interval?

As explained in this FAQ, Kopia creates an incomplete snapshot that is not garbage-collected. When waking up client A should (mostly) pick up were it left and continue until the snapshot can be completed.

I know about these incomplete snapshots (that’s the “latest checkpoint” I mentioned in my question), but my question is what happens to content objects written after the most recent incomplete snapshot (checkpoint). If I understand it correctly, those aren’t referenced by anything until the next incomplete snapshot is made (which might be days later) and can be garbage collected in the meantime.
Consider this sequence of events:

  1. Content objects X, Y are written by machine A
  2. A temporary (checkpoint/incomplete) snapshot is made, referencing X and Y
  3. Content object Z is written by machine A
  4. Writing machine (A) is suspended for a few days
  5. Another machine B (maintenance owner) runs full maintenance, sees Z as unreferenced and marks it as a candidate for deletion
  6. A day later, B runs full maintenance again, sees Z still unreferenced and deletes it
  7. Machine A eventually wakes up from sleep and continues the snapshot process, not knowing Z got deleted in the meantime, ultimately resulting in a broken snapshot

I don’t know the internals well enough to give you a definitive answer but I’m pretty sure the developers thought about this scenario and made sure this doesn’t cause issues.

You should ask this question in Kopia’s Slack channel, as the developers are way more active over there.

Even if the worst case is a bad snapshot, I think that for many similar programs, a box that does a partial send and then goes away for days will lead to… a partial send. In this case, X&Y goes over fine, Z does not. Next time machine A does a backup, it will have to send Z over again as if it was never there, or as if it did exist long ago but expired in the mean time.
So unless there is an issue where A thinks it has sent Z and never tries to send it again, then I would just write it off as “broken snapshots are broken”. There is only so much kopia or any similar system could do. It can’t checkpoint after each file, so whats left is to make sure A knows it needs to send Z again next time.

That’s sort of what questions #2 and #3 are about in my original post above. I can easily live with that one snapshot being broken. I really want to avoid all subsequent snapshots being silently corrupted because they rely on Z being there when it actually isn’t. So my question was: Is A going to detect this and either self-heal, or at least make sure the issue does not affect the entire repo? Or will the only result be that maintenance done by B will fail with some catastrophic inconsistency error, requiring manual cleanup?

The snapshot wouldn’t be broken, but imcomplete. It won’t affect any future snapshot of the same source. If Kopia never uploaded “Z” then nothing can depend on it.

The client would of course know when it wakes up 2 days later that this run is and was not completed but also that there was no acknowledgement of that object Z is successfully stored, so next run would still need to send Z again.