PLEASE READ: Don't use --safety=none for routine maintenance

I’ve noticed folks here often paste command line examples involving running:

$ kopia maintenance --safety=none

I can’t stress this enough:

This is NOT recommended for most users and could be outright dangerous in some cases.

It is recommended to ALWAYS use default safety settings which lets Kopia apply appropriate safety margins. In the short term it means the most-recently written data in the repository may not be immediately compacted, but when running maintenance regularly the repository will be fully compacted over time.

Safety buffers in kopia serve three purposes:

  • ensuring proper cache expiration - for good performance Kopia caches a lot of data. For correctness, we must ensure certian information reliably propagates to all the clients that need to see it, and they could be active, so we must wait long enough for all clients to refresh their caches.

  • clock skew correction - kopia can tolerate clock skew between client(s) and server up to several minutes. --safety=none tells it - “trust me, there is ZERO clock skew”, which is almost always dangerous and can lead to premature blob deletion and data corruption.

  • provider consistency issues - not all storage providers are created equal. Some providers will have weaker consistency than others. For example sometimes after writing a file, that file may not show up in directory listing for a brief moment (milliseconds, sometimes up to seconds or even minutes for very slow connections). Networked filesystems are typical examples of this, but many providers exhibit similar behaviors to a certain degree.

When running garbage collection during maintenance it is critical for Kopia to see ALL data written up to this point (and default safety margins allow for quite significant provider-level inconsistencies and clock skews), otherwise during its mark/sweep garbage collection it may incorrectly treat blobs as unreferenced and delete them prematurely leading to data loss. This has actually happened to several folks and --safety=none was the leading cause of data loss we’ve seen so far.

5 Likes