Best method to ensure valid snapshots: snapshot verify vs snapshot fix invalid-files

My understanding is that kopia snapshot verify --verify-files-percent 100 will download all files in a snapshot and ensure the backup of each file is valid in the snapshot. But, in the case of the backup of a file is not valid, the verify command does not do anything except notify you of the error. In contrast, there is kopia snapshot fix invalid-files --verify-files-percent 100 which will remove backups of invalid source files from snapshots.

This leads me to an important question, which I hope someone can help with. Does fix invalid-files also verify if the backup of a source file is valid? In other words, is fix invalid-files doing a source to target check and removing all targets with an invalid source? Or is it verifying that all source files are properly backed up in a snapshot on the target?

My goal is to ensure that my source files have valid backups in target snapshots. Any tips on how to do that are greatly appreciated. Right now, I have kopia snapshot verify --verify-files-percent 100 running monthly, but I am still not sure what I would need to do is the verify command found errors. Should I be running kopia snapshot fix invalid-files --verify-files-percent 100 instead of kopia snapshot verify --verify-files-percent 100?

1 Like

There are many verification methods, depending on what you need.

In the order of lowest- to highest-level:

  1. kopia content verify - will ensure that content manager index structures are correct and that every index entry is backed by an existing file

  2. kopia content verify --download-percent=10 - same as above, but will download 10% of random contents and ensure they can be decrypted properly

  3. kopia snapshot verify - will ensure that directory structures in the repository are consistent by walking all files and directories in snapshots from their roots and performing equivalent of kopia content verify on all contents required to restore each file, but does not download the file

  4. kopia snapshot verify --verify-files-percent 10 - same as #3 but will also download random 10% of all files, this ensures that decryption and decompression is correct.

  5. kopia snapshot fix invalid-files [--verify-files-percent] - performs exactly the same verification as kopia snapshot verify (3&4) for all practical purposes, but it will also write fixed directory entries and manifests.

As of today, it is not recommend to run snapshot fix automatically, only when 3 or 4 detects a problem.

So where does the difference between 1/2 & 3/4 practically matter, you might ask? Imagine a world were some index blobs are deleted and corresponding pack blobs are deleted too. In this case 1/2 will succeed (because index structures are still internally consistent), but if the blobs were needed for some snapshot 3/4 will fail.

It might be worth adding that full maintenance (which happens automatically in both CLI and UI) does #3 every time as part of mark&sweep garbage collection, so technically running verification is optional today as it will be happening regularly anyway. Paying attention maintenance status is not optional in such cases, though and it’s quite hard to observe - this will be improved through notification mechanism in future versions.

Again, all this depends on on the type and value of data, tolerance for risk and cost of data storage but I would generally recommend running something like this daily:

$ kopia snapshot verify --verify-files-percent 1

This will preform full repository scrub every 100 days on average. With tons of data, perhaps decrease that to 0.3 to get full scrub every 300 days. Some folks may want to even perform test restores - again all depends on use case.

11 Likes

Thanks, this is very helpful.

But why is snapshot fix not recommended to run automatically? If fix essentially does the same as verify except it also fixes errors that are found, what is the downside to running fix regularly? Is the fix process not foolproof?

4 Likes

Thank you very much for explanation !

Does this means that absolutely every file will be checked over 100 days ?
Asking this because according to help --verify-files-percent is checking files randomly

    --verify-files-percent=0   Randomly verify a percentage of files by
                               downloading them [0.0 .. 100.0]

No, there is no guarantee that all files will be checked unless you do --verify-files-percent=100. Doing --verify-files-percent=1 every day for 100 days will not check 100% of the files, since the command grabs files at random and it is not doing any sort of record keeping in terms of which files were grabbed yesterday or last time. Over time, you will likely be checking a large percent of your files due to the randomness, but there is no guarantee.

Yes, that’s was my understanding too, that’s why I confused but phrase:

According to scie… my googling of science, it will sort-of cover 64% of the data if done for 100 days.
Or inversely, the chance of any one percent of your data not getting tested over 100 days is around 36%

Still, over a year or so, most of the data should get tested, while not spending too much time every day doing the validations.

1 Like

and the math for “over a year” says any one percent of your data not getting tested would be only 2% which seems like pretty good chances to cover it all without excessive tests.

What am I missing here? Should I run this command again with the --commit flag?

kopia snapshot fix invalid-files
Listing blobs…
10000 blobs…
20000 blobs…
30000 blobs…
40000 blobs…
50000 blobs…
60000 blobs…
70000 blobs…
80000 blobs…
90000 blobs…
100000 blobs…
110000 blobs…
120000 blobs…
130000 blobs…
140000 blobs…
150000 blobs…
160000 blobs…
170000 blobs…
180000 blobs…
190000 blobs…
200000 blobs…
210000 blobs…
220000 blobs…
230000 blobs…
240000 blobs…
250000 blobs…
260000 blobs…
270000 blobs…
280000 blobs…
290000 blobs…
300000 blobs…
310000 blobs…
320000 blobs…
330000 blobs…
340000 blobs…
350000 blobs…
360000 blobs…
370000 blobs…
380000 blobs…
390000 blobs…
400000 blobs…
410000 blobs…
420000 blobs…
430000 blobs…
440000 blobs…
450000 blobs…
460000 blobs…
470000 blobs…
480000 blobs…
490000 blobs…
500000 blobs…
510000 blobs…
520000 blobs…
530000 blobs…
540000 blobs…
550000 blobs…
560000 blobs…
570000 blobs…
580000 blobs…
590000 blobs…
600000 blobs…
610000 blobs…
620000 blobs…
630000 blobs…
640000 blobs…
650000 blobs…
660000 blobs…
670000 blobs…
680000 blobs…
690000 blobs…
700000 blobs…
710000 blobs…
Listed 713203 blobs.
Listing all snapshots…
Processing snapshot User@desktop-kh0mk6g:G:
2024-08-16 22:23:24 CDT replaced manifest from 16a59d079cf30f8f5354773941b9e176 to 16a59d079cf30f8f5354773941b9e176
diff k12993d26008a0ef69cef23a35390073f k73d24d6a36e8d3244b343cba2d6f4b1a
2024-08-22 14:22:45 CDT replaced manifest from 341784e762a3212ec0a34a6e4b8393e0 to 341784e762a3212ec0a34a6e4b8393e0
diff k12993d26008a0ef69cef23a35390073f k73d24d6a36e8d3244b343cba2d6f4b1a
2024-08-29 01:38:59 CDT replaced manifest from b7ec0c7e5d95915f123da589accc5ef4 to b7ec0c7e5d95915f123da589accc5ef4
diff k12993d26008a0ef69cef23a35390073f k73d24d6a36e8d3244b343cba2d6f4b1a
2024-08-30 17:13:08 CDT replaced manifest from f0ea9967a9e3af8f4d77cca7714dee6c to f0ea9967a9e3af8f4d77cca7714dee6c
diff kd2e1a9e53a393a39947bf1858a3a2ee1 k563dde8d4e5424e510c9b49a3dddba61
2024-08-31 15:37:34 CDT replaced manifest from 3d3756f1a44ad3a363c7a860f3b86600 to 3d3756f1a44ad3a363c7a860f3b86600
diff kd2e1a9e53a393a39947bf1858a3a2ee1 k563dde8d4e5424e510c9b49a3dddba61
2024-08-31 16:32:16 CDT replaced manifest from 779a322eb5d2667b1666ef3c4a29f848 to 779a322eb5d2667b1666ef3c4a29f848
diff kd2e1a9e53a393a39947bf1858a3a2ee1 k563dde8d4e5424e510c9b49a3dddba61
2024-08-31 21:51:42 CDT replaced manifest from b1996041602d2bd29f5a866550890ddb to b1996041602d2bd29f5a866550890ddb
diff kd2e1a9e53a393a39947bf1858a3a2ee1 k563dde8d4e5424e510c9b49a3dddba61
2024-09-01 23:52:02 CDT replaced manifest from 80ce8eb93c4148da1c90b8a5e0520ee4 to 80ce8eb93c4148da1c90b8a5e0520ee4
diff kd2e1a9e53a393a39947bf1858a3a2ee1 k563dde8d4e5424e510c9b49a3dddba61
2024-09-13 20:12:03 CDT replaced manifest from d8f8885e14a4d9acf430a3dffd6d1211 to d8f8885e14a4d9acf430a3dffd6d1211
diff kf290ab52553cfd8b53f6d4938672100c k8c76441f5091bdefe6bd815860904095
2024-09-24 20:09:02 CDT replaced manifest from 0353c34668b4bb1b6c59e347ff13f311 to 0353c34668b4bb1b6c59e347ff13f311
diff k55b5cba2c1625773a522573c8f18e51c k6a9ad5afbfb62fe1fa27ac55fb3c316b
Processing snapshot User@desktop-kh0mk6g:M:\Backup
2024-08-22 14:29:27 CDT unchanged (b2f7d8135cb63eaa18c5a53f88b09d5a)
2024-08-29 00:18:32 CDT unchanged (964193bc964d408568e31d805cbbe6b4)
2024-09-05 00:28:32 CDT unchanged (9a4c2ae58fc74f216eae626dc5bea7d8)
2024-09-11 00:26:16 CDT unchanged (2bf26434b3292dff8d74639df6d2dabe)
2024-09-13 07:32:04 CDT unchanged (ef7deb524a292be0b01e8c3aa3ec4353)
2024-09-15 19:44:44 CDT unchanged (3a9498714b1625e3c8e6fd1b7c7b4ab7)
2024-09-20 06:12:31 CDT unchanged (e2a451293cd1b0446bd7b335b60e35ba)
2024-09-24 14:54:33 CDT unchanged (fb7c1a710301fd2a632d11c265acac1c)
2024-10-02 20:59:16 CDT unchanged (6b6fe7741cec4bc4f0745ef5caa89f17)
2024-10-07 22:44:25 CDT unchanged (d8115f53a1c7eec662b85703eeddbeb1)
2024-10-07 23:29:26 CDT unchanged (46bd29366fa808620ef04f7c5ff43a2e)
Processing snapshot User@desktop-kh0mk6g:S:\Archive
2024-08-17 00:35:37 CDT unchanged (5c9e09045ca6c8ad6d8b4955ac222e10)
2024-08-22 13:27:35 CDT unchanged (0c3eabf74a31b5ce14231c5db5b5d45f)
2024-08-28 23:08:20 CDT unchanged (06bb9399c06ba961741d9bac51ba6af0)
2024-09-10 16:12:28 CDT unchanged (5ae515d6de4a9e99541fe4c67f1f9906)
2024-09-13 09:00:27 CDT unchanged (4e19a769a05f27bc1505b74e67a8d2b5)
2024-09-13 11:31:03 CDT unchanged (44a8713285f371b3e15fc7dbb76f916f)
2024-09-24 18:25:50 CDT unchanged (aec9bf8fa6862caf0b99e442e4bdf75c)
2024-09-24 19:10:52 CDT unchanged (27589e8d9405d3336d954976d9e6dbbe)
Fixed 10 snapshots, but snapshot manifests were not updated. Pass --commit to update snapshots.

Yes, that is what is stated on the last line.

@jkowalski -

  1. Would it be hard to implement a way to know the date each chunk/file was last verified? Snapraid, for example, does this. You can “scrub” (verify) blocks by % or by x days since last verification, or a combination of both. When dealing with a 60TB pool, it’s kind of a necessity.
    For Kopia, I have it backing up to a filesystem repository on remote server, currently around ~6TB. Doing a full 100% file verification is pretty impractical. I have already tried a few times and even on a fast pipe (35MB/s) it takes upwards of 3 days and I haven’t been able to successfully complete one because of network issues or who-knows-what. Not to mention for most people, sending multiple TB across the network in one continuous stream isn’t even an option.

  2. Does this mean there is literally no way for Kopia to guarantee that all my data is healthy and retrievable for restore unless it downloads the entire repository and locally verifies each file (essentially locally restoring the entire repo), and the only means of doing that is a single command that has to run continuously until completed with no pause/resume/checkpoint feature to pick up where it left off?

  3. Can you please help me understand why this is true? :arrow_down:

ps - Your top post breaking down the tools for verification is super helpful. I’d recommend adding it to the official documentation somewhere, probably here

I understand your predicament, just wanted to add some small point. Even if the “read 1% per day” isn’t going to cover 100% after 100 days, something along those lines is what you will have to do if you want as much as possible of:

  1. You can’t DL the whole lot (understandable)
  2. You like for (quote) “Kopia to guarantee that all my data is healthy”
  3. Data gets replaced daily within some timeframe

Even if it feels like nitpicking and I do support whatever idea we can come up with to make checks restartable, checkpointable, or in some other fashion be able to verify most/all data, there is still a matter of chance involved, since even if you knew 60 days ago that all data was 100% ok then, the validation could be useless for some percentage of the data the next day, if maintenance expired this data. For a moving target it becomes very tedious to always have everything validated, especially without downloading it. Somewhere, there needs to run some kind of check and with the client-based design of kopia, this will have to be some machine you provide in some manner.

If we checked all blobs today, then 2 months passes and we want to check everything in between, we would need to run over all data from daily-0 to daily-59. Even if most/some of the data in these snapshots didn’t change from what is now daily-60 which we did validate before, we can’t be certain that some (static) blob which all dailys reference didn’t get broken 30 days from now without actually downloading it somewhere and checking it.
My example uses long dates, but even if you change this from 30/60 to today/yesterday, it still means that if I back up a 30G machine and send 500M, some machinery would need to DL and validate the other 29.5 gigs of data which I didn’t have to send, if I am to figure out that this box can be fully restored. And same tomorrow. No matter how little it sends over, it will always be required to read 30G - (todays data) to really make sure it is still valid. Or, you are going to have to chance it and this is basically what the “validate X% per day” thing does. Scatter reads around, hope to find a problem before its actually needed so you can react to it.

Even if we stash the backup in S3 (which can give you md5s of the uploaded objs) and make a list of objs and their checksums and keeping it for say 60 days, knowing that the blob contents are unbroken would still need downloading. Chances are super low that they are, but actually knowing would require it.

As I said, I’m not trying to shoot down effort, just visualizing what actually can be done on this kind of “client-based” backup system within the demands of “wanting to know that it all is ok” and “not downloading a lot”. It is a tough balance there.

I hear you (a tough balance for sure) and thanks for the conversation, but I think my personal threshold for considering something “verified” accepts inherent risks involved in all data storage and retrieval and is no less true with other backup solutions.

You can take your argument to its conclusion and say that even one second after a specific file has been verified, it is no longer guaranteed to still be in that verified state. And I understand that Kopia alone can’t provide verification without reading back the data. This is why I want it to record when any one piece of data was last verified - so I can define my own risk tolerance. Right now, I can either (A) choose how many times I roll my 600,000-sided dice hoping that eventually, over multiple sessions, it’ll land on all of them, or (B) land on each of the 600,000 sides in one sitting (with no bathroom breaks!).

I’m asking for a completely different paradigm which would allow me to define the level of risk I’m willing to take on based on:

  1. the medium the repo is stored on
  2. how much and how often data changes on the repo
  3. of the data currently on the repo, how much of it is verified and how long ago that verification happened

My Kopia backups vary quite a bit. Some take snapshots every 20 minutes, others once a month. I would love the option to have Kopia verify (and auto-heal) data that hasn’t been verified in “x” days, which I would set according to each scenario.

If I’m backing up to a ZFS pool I control with data that changes infrequently, I might not feel the need to constantly verify remotely via Kopia since there’s a pretty high chance that if Kopia said the data was good a month ago, it’s still good.

Going back to Snapraid, on my 60TB array, which is mostly for archival and holds data which does not change often, my threshold is 180 days – any block that hasn’t been scrubbed in 180 days gets scrubbed again.

Also, your description of adding 500M to a previously-verified 30G repo requiring the remaining 29.5 to be re-verified is a different level of verification which is beyond what I’m suggesting and what is currently offered. Our concepts could possibly be combined in interesting ways (eg. verification being presented to the user at the snapshot level, meaning a --min-age=x flag represent the days since each individual snapshot was last verified, but possibly even still tracked at the repo data level to minimize egress since there is so much overlap in what underlying repo data makes up each snapshot). This is complete theory for me though and really for @jkowalski to debunk.

Zooming out, I get it. This is an open-source, small, relatively young project, and improving this one (important) feature would only make meaningful improvements to what I’m guessing is a small subset of users. Kopia shouldn’t even be used in a professional environment anyways, and often couldn’t because of this very issue, amongst others – try to tell a client that you cannot provide proof of verification unless they want to pay insane egress fees. But I use it this way :face_with_spiral_eyes: :face_with_spiral_eyes: :face_with_spiral_eyes:

Boiling it down,

  1. Is it possible to implement a method of recording a verification history of sorts, or otherwise improving the verify/repair toolset toward my goals?
  2. Is my understanding of Kopia’s current tools even correct?
  3. Why is it not recommended to run snapshot fix and instead do snapshot verify? There’s something I must be missing cause if it finds an issue, why not just fix it?! I wouldn’t bother poking at this if the notification options were better. There is some more learning I could do on my end to set up better monitoring outside of Kopia, but this is the only piece of my data management that doesn’t ping me in one way or another if there are issues so I’m sensitive to this!

A small nit here is that I have my kopia repo in an S3 store with VMs close by network-wise so that I actually don’t get xfer fees. So for me and others having their repo stored in hw they own and control, having a validation run done is actually doable, at least until the repo takes more than 23-something hours to check. I do get that not everyone can do this, but there are places where internal traffic is free and for those who have it like that, being very strict about validation is still possible. It’s a good discussion to have to see what options there is so we see what kopia can aim for having.

Also this, ZFS will make regular scrubs of the sectors, and resilvering if requested, so it already has a method for running over random parts of your data and rechecking each byte, over and over. It is a very good choice for storing your backups on, since you can sort-of offload major parts of the validation to the filesystem instead and decide to trust it.

Its just that ZFS won’t scrub regularly on its own… you’ll have to run it manually. Nonetheless… ZFS one of the best file systems available today.