Kopia repository 'corrupted itself' (unable to add to XXX index blob cache: context canceled)

I had a working Kopia repo with around 100TB of data - everything was going well, it has not been touched for two weeks or so.

I kicked of a small incremental backup today, which completed successfully:

 * 0 hashing, 338337 hashed (878.7 GB), 28078694 cached (94.8 TB), uploaded 876.2 GB, estimated 22 TB (435.2%) 0s left
Created snapshot with root k3ad2929159e95246105260ef08b97a3d818b4fc02c580e8816fefe8413946756 and ID 3bee2c042f4de4cb7088f9ef616462e8 in 36m59s

However, after a while (maybe when maintenance kicked in?) it was not possible to interact with the repository, after re-connecting it it spams the following errors:

 kopia  repository connect filesystem --path /data/kopia
Enter password to open repository:

ERROR unable to add xs3_c3ec61601cc770f7bbdaf48949515c01ef2c7616d2e5d26ce549110a68298375-s0297ba3eb70c5c51-c2 to index blob cache: context canceled
ERROR unable to add xs3_3e37d91bf794353e5fdd522fd07d216e4e9c3c6a7aed025214bca98596137665-s0297ba3eb70c5c51-c2 to index blob cache: context canceled
ERROR unable to add xn5_2ed5e36ce5ab7448c3628a8cc50992f341c606333fbd81118a6279b3a0e1c455-s0352ce35a4f9164f118-c1 to index blob cache: context canceled
ERROR failed to open repository: unable to create shared content manager: error loading indexes: unable to load pack indexes despite 10 retries
ERROR error connecting to repository: unable to create shared content manager: error loading indexes: unable to load pack indexes despite 10 retries

The server / processes were not restarted or touched in any way and repository is backed by a ‘high quality’ DAS storage + XFS filesystem - no signs of corruption on this level or anything like that.

What would be the process to recover the data / repo here? There is a kopia repository repair command, but that one results with:

looking for format blob...
format blob already exists, not recovering, pass --recover-format=yes

I stopped at this point, fearing for more corruption… Please advise.

Wondering if anyone got any idea here? The documentation (Verifying Validity of Snapshots/Repositories and Trying to Repair Any Corruption | Kopia) seems to be rather sparse on details (commands to use / process to follow), besides mentioning ‘many low-level data recovery tools’ being available and none of the suggestions (empty files, wrong timestaps, purging cache etc) help or apply here.

Even some basic steps where to start and how to approach this corruption here would be super useful.

Frankly I have no idea, but this error:

… suggest something about cache (this type of files residents in the cache), did you try deleting cache directory completely?

Yes, I purged cache dir completely (but kept a copy, just in case) - this did not make any difference sadly.

Content logs from the connection attempt to the kopia local filesystem repository:

2023-02-01T13:31:54.364383Z refreshing committed state because it's no longer valid (now 2023-02-01 13:31:54.364376205 +0000 GMT, valid until 0001-01-01T00:00:00Z)
2023-02-01T13:31:54.364527Z refreshAttemptLocked
2023-02-01T13:31:54.365119Z ranges: []
2023-02-01T13:31:54.365335Z current epoch 6, uncompacted epoch sets 2 3 0, valid until 2023-02-01T13:51:54.364533613Z
2023-02-01T13:31:54.365340Z not generating range checkpoint
2023-02-01T13:31:54.365511Z adding incremental state for epochs 0..7 on top of []
2023-02-01T13:31:54.365585Z Complete Index Set for [0..7]: [xs0_794114b73cf07f6c6e2b60e40bfd1637748edd902ccbb155742260c3ca1b167f-s547ec367fb6534fe-c2 xs0_304ec9212a8b0aad4262621d4141e1117d27e22c81ee1802d5d7fe58d7482ee9-s547ec367fb6534fe-c2 xs1_c79f6cba6334ad6b0c1ea707a70f328c41de86f9c2bd9c9ac91334f5ee90f955-s3b1ed804d6a342ba-c2 xs1_bb3f7bf80e448a582f848d31420de0fa3331276890417eca181ef3d301267858-s3b1ed804d6a342ba-c2 xs2_55cd477495f12d87e7dc1a5aef0b38b3e4927855a22ed3231745c69bb890dc0d-s70a63d9cf3c4694d-c1 xs3_c3ec61601cc770f7bbdaf48949515c01ef2c7616d2e5d26ce549110a68298375-s0297ba3eb70c5c51-c2 xs3_3e37d91bf794353e5fdd522fd07d216e4e9c3c6a7aed025214bca98596137665-s0297ba3eb70c5c51-c2 xs4_d6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1 xn5_f13ce08c7e6bdf434785943ffe375812355f96f2540017abdc4930dc31a0c7d1-s7df0d3ff5acf629b119-c1 xn5_2ed5e36ce5ab7448c3628a8cc50992f341c606333fbd81118a6279b3a0e1c455-s0352ce35a4f9164f118-c1 xn6_0be64812ca3ec308fac8935e561226382d1ac0777eab4bc118947bcf685f98a8-s0bddba24744055f0119-c1 xn6_a63612afa987ecd27c5ea2a19b0c353c07b0a360e656a0f1af088f500436a71d-se3a9aac4e8465d78119-c1 xn6_b6f63236f5c648460c93d0fef1b0874f8d2b5fcb61158b39437c95f95dc78ae8-sfeb3ea1a7ff70853119-c1], deletion watermark 2023-01-09 16:25:14 +0000 GMT
2023-02-01T13:31:54.365628Z active indexes [xs0_794114b73cf07f6c6e2b60e40bfd1637748edd902ccbb155742260c3ca1b167f-s547ec367fb6534fe-c2 xs0_304ec9212a8b0aad4262621d4141e1117d27e22c81ee1802d5d7fe58d7482ee9-s547ec367fb6534fe-c2 xs1_c79f6cba6334ad6b0c1ea707a70f328c41de86f9c2bd9c9ac91334f5ee90f955-s3b1ed804d6a342ba-c2 xs1_bb3f7bf80e448a582f848d31420de0fa3331276890417eca181ef3d301267858-s3b1ed804d6a342ba-c2 xs2_55cd477495f12d87e7dc1a5aef0b38b3e4927855a22ed3231745c69bb890dc0d-s70a63d9cf3c4694d-c1 xs3_c3ec61601cc770f7bbdaf48949515c01ef2c7616d2e5d26ce549110a68298375-s0297ba3eb70c5c51-c2 xs3_3e37d91bf794353e5fdd522fd07d216e4e9c3c6a7aed025214bca98596137665-s0297ba3eb70c5c51-c2 xs4_d6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1 xn5_f13ce08c7e6bdf434785943ffe375812355f96f2540017abdc4930dc31a0c7d1-s7df0d3ff5acf629b119-c1 xn5_2ed5e36ce5ab7448c3628a8cc50992f341c606333fbd81118a6279b3a0e1c455-s0352ce35a4f9164f118-c1 xn6_0be64812ca3ec308fac8935e561226382d1ac0777eab4bc118947bcf685f98a8-s0bddba24744055f0119-c1 xn6_a63612afa987ecd27c5ea2a19b0c353c07b0a360e656a0f1af088f500436a71d-se3a9aac4e8465d78119-c1 xn6_b6f63236f5c648460c93d0fef1b0874f8d2b5fcb61158b39437c95f95dc78ae8-sfeb3ea1a7ff70853119-c1] deletion watermark 2023-01-09 16:25:14 +0000 GMT
2023-02-01T13:31:54.365751Z Downloading 13 new index blobs...
2023-02-01T13:32:02.263573Z encountered NOT_FOUND when loading, sleeping 100ms before retrying #1
2023-02-01T13:32:02.364135Z adding incremental state for epochs 0..7 on top of []

(then the Complete Index Set for [0..7]: part loops 9 times, every time with encountered NOT_FOUND when loading, sleeping 25.6s before retrying #9 error)

Also, if you check one of the files it complains about during repository connect command:

ERROR unable to add xs4_d6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1 to index blob cache: context canceled

The file actually exists and seem to be ‘fine’ (no zero size or anything like this):

# ll "/data/kopia/x/s4/_d/6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1.f" -h
-rw------- 1 root root 23M Jan 31 17:17 /data/kopia/x/s4/_d/6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1.f

I was able to recover this thing:

  1. Located files for all the ‘active indexes’ from the content-logs log file:
2023-02-01T13:31:54.365628Z active indexes [xs0_794114b73cf07f6c6e2b60e40bfd1637748edd902ccbb155742260c3ca1b167f-s547ec367fb6534fe-c2 xs0_304ec9212a8b0aad4262621d4141e1117d27e22c81ee1802d5d7fe58d7482ee9-s547ec367fb6534fe-c2 xs1_c79f6cba6334ad6b0c1ea707a70f328c41de86f9c2bd9c9ac91334f5ee90f955-s3b1ed804d6a342ba-c2 xs1_bb3f7bf80e448a582f848d31420de0fa3331276890417eca181ef3d301267858-s3b1ed804d6a342ba-c2 xs2_55cd477495f12d87e7dc1a5aef0b38b3e4927855a22ed3231745c69bb890dc0d-s70a63d9cf3c4694d-c1 xs3_c3ec61601cc770f7bbdaf48949515c01ef2c7616d2e5d26ce549110a68298375-s0297ba3eb70c5c51-c2 xs3_3e37d91bf794353e5fdd522fd07d216e4e9c3c6a7aed025214bca98596137665-s0297ba3eb70c5c51-c2 xs4_d6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1 xn5_f13ce08c7e6bdf434785943ffe375812355f96f2540017abdc4930dc31a0c7d1-s7df0d3ff5acf629b119-c1 xn5_2ed5e36ce5ab7448c3628a8cc50992f341c606333fbd81118a6279b3a0e1c455-s0352ce35a4f9164f118-c1 xn6_0be64812ca3ec308fac8935e561226382d1ac0777eab4bc118947bcf685f98a8-s0bddba24744055f0119-c1 xn6_a63612afa987ecd27c5ea2a19b0c353c07b0a360e656a0f1af088f500436a71d-se3a9aac4e8465d78119-c1 xn6_b6f63236f5c648460c93d0fef1b0874f8d2b5fcb61158b39437c95f95dc78ae8-sfeb3ea1a7ff70853119-c1] deletion watermark 2023-01-09 16:25:14 +0000 GMT
-rw------- 1 root root 728M Jan  6 17:34 /data/kopia/x/s0/_3/04ec9212a8b0aad4262621d4141e1117d27e22c81ee1802d5d7fe58d7482ee9-s547ec367fb6534fe-c2.f
-rw------- 1 root root 621M Jan  8 17:22 /data/kopia/x/s1/_c/79f6cba6334ad6b0c1ea707a70f328c41de86f9c2bd9c9ac91334f5ee90f955-s3b1ed804d6a342ba-c2.f
-rw------- 1 root root 622M Jan  8 17:23 /data/kopia/x/s1/_b/b3f7bf80e448a582f848d31420de0fa3331276890417eca181ef3d301267858-s3b1ed804d6a342ba-c2.f
-rw------- 1 root root 774M Jan 10 12:56 /data/kopia/x/s2/_5/5cd477495f12d87e7dc1a5aef0b38b3e4927855a22ed3231745c69bb890dc0d-s70a63d9cf3c4694d-c1.f
-rw------- 1 root root 442M Jan 13 18:12 /data/kopia/x/s3/_c/3ec61601cc770f7bbdaf48949515c01ef2c7616d2e5d26ce549110a68298375-s0297ba3eb70c5c51-c2.f
-rw------- 1 root root 442M Jan 13 18:12 /data/kopia/x/s3/_3/e37d91bf794353e5fdd522fd07d216e4e9c3c6a7aed025214bca98596137665-s0297ba3eb70c5c51-c2.f
-rw------- 1 root root 23M Jan 31 17:17 /data/kopia/x/s4/_d/6dc97c7779b18aeb0987dc544cd4959f1de0d7844ab0dc0c360eec897337911-s51e5da9034927876-c1.f
-rw------- 1 root root 5.7M Jan 31 16:07 /data/kopia/x/n5_/f13ce08c7e6bdf434785943ffe375812355f96f2540017abdc4930dc31a0c7d1-s7df0d3ff5acf629b119-c1.f
-rw------- 1 root root 5.0M Jan 13 18:14 /data/kopia/x/n5/_2/ed5e36ce5ab7448c3628a8cc50992f341c606333fbd81118a6279b3a0e1c455-s0352ce35a4f9164f118-c1.f
-rw------- 1 root root 6.1M Jan 31 16:34 /data/kopia/x/n6_/0be64812ca3ec308fac8935e561226382d1ac0777eab4bc118947bcf685f98a8-s0bddba24744055f0119-c1.f
-rw------- 1 root root 7.0M Jan 31 16:27 /data/kopia/x/n6_/a63612afa987ecd27c5ea2a19b0c353c07b0a360e656a0f1af088f500436a71d-se3a9aac4e8465d78119-c1.f
-rw------- 1 root root 6.7M Jan 31 16:17 /data/kopia/x/n6_/b6f63236f5c648460c93d0fef1b0874f8d2b5fcb61158b39437c95f95dc78ae8-sfeb3ea1a7ff70853119-c1.f
  1. Then got rid of all the files created on the day when the corruption happened (Jan 31) - this made connect command work again and it looks like verification is passing now.

It is still a mystery how did this condition happen in the first place - the server and storage are stable and did not have any problems.

After re-connecting to recovered repo, I did notice a bunch of failed maintenance tasks, one at the day of corruption - unclear if this is related though:

  snapshot-gc:
    2023-01-31 12:05:08 GMT (0s) ERROR: unable to find in-use content ID: error processing snapshot root: error verifying Ix6fcc3679fc96b91e84effb523990f5479df701bd16e921e83a0f305e02014ca6: unexpected content error: error getting cached content: failed to get blob with ID q1f8a40fa8e1be88f712e00370cb81466-s3ab80925658eccef118: BLOB not found
    2023-01-24 12:05:08 GMT (0s) ERROR: unable to find in-use content ID: error processing snapshot root: error verifying Ixfdca95967f07429129c8f2c84fb098a41c48238664a4f2936cf7108b752c20ee: unexpected content error: error getting cached content: failed to get blob with ID q517134a1daa58c4d3d10c8b5a1238813-sc12b759c64f56e46118: BLOB not found
1 Like