[TESTERS NEEDED] Major improvements to repository index format

TL;DR

I’d like to let you know about some big and exciting features for v0.9 that have recently been merged. I could really use your help testing them ahead of the official release.

If you’re interested, read on.

New Features

New format is designed to support two big features:

  1. content-level compression, which rewires snapshot pipeline to perform compression after hashing and allows room for future improvements, such as error-correcting codes (ECC)

  2. epoch-based index management, which will support append-only operation (soon)

The code at HEAD has both features, but they are disabled by default and all new repositories will default to old format until the feature has been proven stable

I’m looking for folks who are willing to try the new format on their repositories, either by using in-place upgrade or creating brand-new repositories just for this purpose. I’m looking for coverage of all kinds of repositories, big and small, with different levels of concurrency, policies, storage backends, etc.

There is non-trivial amount of risk because the code is new and been tested only in very limited set of circumstances, yet it’s critical to get as much real-world coverage before declaring it stable for all users.

Using pre-release features

To create repository using new features, you need to use build from latest source or download a pre-release build from here or from an unstable RPM or APT channel.

You will need to pass two new flags on repository creation. UI-only repository creation using new format is not supported yet, will be added soon.

$ kopia repository create ... --index-version=2 --enable-index-epochs

To upgrade existing repository to the latest format use the following command (WARNING: this is a one-way operation, there’s no way to revert to old format yet).

$ kopia repository set-params --upgrade

What to expect:

  • snapshots of a brand-new machine or directory that have files which exist elsewhere should be faster to create if compression is used, because deduplication will happen before compression.

  • the number of index blobs in the repository will increase (currently the number of index blobs is kept small during maintenance) and we will be retaining one index file per each epoch (typically max 4 epochs per day, usually much less), but we won’t be rewriting indices as aggressively as today

  • the total size of indexes may grow by 2x-3x, but they usually consume <0.1% of total repository size, so that’s not really significant, on the plus side if the repository gets corrupted, there will be higher chance of data recovery using redundant data.

  • the format should be stable and compatible with upcoming v0.9, but if major bugs are found we may need to nuke the repositories and start again.

What to look for and report:

  • any data loss or corruption
  • inconsistent view of repository across clients
  • unexpected behavior changes, such as things being dramatically faster with no explanation or dramatically slower
  • much higher or unexpectedly lower memory or CPU usage
  • unexpected slowdowns when running CLI commands

Next steps:

Because epoch-based indexes are time-based, testing this will require some calendar time to pass.

I’d like to test this until the end of August, which should be enough time to go through between 100 and 200 epochs for most repositories with significant traffic (there will be max 4 epochs per day, depending on activity).

In the meantime we’ll be focusing on polishing the features, adding more CLI commands for debugging and troubleshooting, improving logging, etc.

At the end of the test period (early September) we’ll collectively decide if the features are stable enough to be enabled by default in v0.9 for new repositories or if we’ll need to keep them in the test phase for longer.

The optimistic plan is:

  • v0.9 supports new repository format by default, old format can be still be enabled using flags
  • v0.10 supports old and new format, will be prompting legacy users to migrate to the new format
  • v0.11 supports new repository format only, old index format not supported

If you’re interested in helping with testing, please like this post using the heart icon below and please discuss any issues here in this thread.

8 Likes

Ok, I am willing to help.

As I mentioned elsewhere I have just started to use kopia some days ago.

My current usage is

  1. Snapshots to a repository on two different external USB drives
  2. Snapshots to a pcloud repository which is mounted on my Ubuntu 20.04 LTS system at /home/manfred/pCloudDrive. Here it is important to know that my upload bandwidth is 10Mbit/sec

I think you know best how I could build up a test scenario.

Hi!

Creating a repository hangs on S3, when “–enable-index-epochs” is used (works without it).

2021-07-08 11:00:57.9456429 [epoch-manager] refreshing committed state because it's no longer valid
2021-07-08 11:00:57.9457803 [epoch-manager] refreshAttemptLocked
2021-07-08 11:00:57.9796620 [epoch-manager] ranges: []
2021-07-08 11:00:58.0459255 [epoch-manager] refresh attempt failed: error loading uncompacted epochs: error listing uncompacted epochs: error listing uncompacted epoch 1: error listing all blobs: context canceled, sleeping 100ms before next retry
2021-07-08 11:00:58.0459734 [epoch-manager] refreshAttemptLocked
2021-07-08 11:00:58.0460474 [epoch-manager] ranges: []
2021-07-08 11:00:58.0462622 [epoch-manager] refresh attempt failed: error loading uncompacted epochs: error listing uncompacted epochs: error listing uncompacted epoch 1: error listing all blobs: context canceled, sleeping 150ms before next retry
2021-07-08 11:00:58.0462983 [epoch-manager] refreshAttemptLocked
2021-07-08 11:00:58.0463596 [epoch-manager] ranges: []
2021-07-08 11:00:58.0465534 [epoch-manager] refresh attempt failed: error loading uncompacted epochs: error listing uncompacted epochs: error listing uncompacted epoch 1: error listing all blobs: Get "https://s3.dualstack.eu-central-1.amazonaws.com/[bucket name]/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=xn1_": context canceled, sleeping 225ms before next retry
2021-07-08 11:00:58.0465942 [epoch-manager] refreshAttemptLocked
2021-07-08 11:00:58.0466577 [epoch-manager] ranges: []
2021-07-08 11:00:58.0469033 [epoch-manager] refresh attempt failed: error loading uncompacted epochs: error listing uncompacted epochs: error listing uncompacted epoch 1: error listing all blobs: Get "https://s3.dualstack.eu-central-1.amazonaws.com/[bucket name]/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=xn1_": context canceled, sleeping 337.5ms before next retry

(The errors in the log just keep repeating, no other errors show up)

Hope this helps!

Edit: This is on Linux x86_64, Kernel 5.12.14

OK, great!

I’ve just upgraded the kopia repository server and one of the clients to v20210706.0.213808. I will upgrade the other clients over the next week and perform restore tests in the next month.

PS remenber to set --config-file flag, pointing to your config file if you are upgrading. E.g.: kopia repository set-params --upgrade --config-file /root/.config/kopia/repository.config

Update: I’ve updated all clients (Windows, and Debian based linux).

Jarek, if you have any specific commands you would like us to run, e.g. to get the size of our repo’s, the amount of content blocks, a specific benchmark or whatever just mention it please.

Ok, I have build the new kopia which is now: 20210708.0.0-62ad437b build: 62ad437bb6e5db327212354c43adf8c2eedcf7da from:

Then converted the repository on pCloud. Will run all my backups now with that kopia version.

As said by others, if you want us to run specific commands just tell.

There are no specific commands to run, just use Kopia regularly, create snapshots, delete them, etc. There should be no visible changes, but please report anything unusual.

Can you paste log entries just before it started looping? I’ll have a fix to end the infinite loop, but there’s some underlying issue there.

N/M I was able to reproduce the issue (happens only on s3, not other providers for some reason) and found a fix: epoch: handling of canceled context on refresh by jkowalski · Pull Request #1178 · kopia/kopia · GitHub, should be merged shortly.

Awesome - creating a repository on S3 works now on latest master.

Thanks a lot for the quick fix and your work on Kopia!

1 Like

@jkowalski , I had no issue updating my repos (local, local server and S3). However, kopia repo-sync S3 now throws an error, when trying to synchronize my local updated server repo:

[root@kopia repos]# ./syncFastST7AtoS3.sh
Connected to repository.

NOTICE: Kopia will check for updates on GitHub every 7 days, starting 24 hours after first use.
To disable this behavior, set environment variable KOPIA_CHECK_FOR_UPDATES=false
Alternatively you can remove the file "/root/.config/kopia/repository.config.update-info.json".
Quick maintenance paused until 2021-07-11 12:28:36 CEST
Synchronizing repositories:
  Source:      Filesystem: /mnt/kopia/ST7A
  Destination: S3: s3.eu-central-1.wasabisys.com kopiast7a
kopia: error: destination repository contains incompatible data, try --help

This is what I got from the cli-log:

[root@kopia cli-logs]# cat kopia-20210711-141426-304559-repository-sync-to-s3.log
2021-07-11 14:14:26.800 D [logger.go:254] build version v20210710.0.220902, available v
2021-07-11 14:14:26.800 D [logger.go:254] no updated version available
2021-07-11 14:14:26.800 D [logger.go:254] password for /root/.config/kopia/repository.config retrieved from password file
2021-07-11 14:14:27.132 D [logger.go:254] [STORAGE] ListBlobs("xn1_")=<nil> returned 0 items and took 4.90114ms
2021-07-11 14:14:27.132 D [logger.go:254] [STORAGE] ListBlobs("xn0_")=<nil> returned 4 items and took 4.668142ms
2021-07-11 14:14:27.137 D [logger.go:254] [STORAGE] ListBlobs("xn1_")=<nil> returned 0 items and took 4.736503ms
2021-07-11 14:14:27.139 D [logger.go:254] [STORAGE] ListBlobs("xn0_")=<nil> returned 4 items and took 4.515305ms
2021-07-11 14:14:27.140 I [logger.go:244] Synchronizing repositories:
2021-07-11 14:14:27.140 I [logger.go:244]   Source:      Filesystem: /mnt/kopia/ST7A
2021-07-11 14:14:27.140 I [logger.go:244]   Destination: S3: s3.eu-central-1.wasabisys.com kopiast7a
2021-07-11 14:14:27.140 D [logger.go:254] [STORAGE] GetBlob("kopia.repository",0,-1)=({1078 bytes}, <nil>) took 27.961µs

You need to manually copy updated kopia.repository to the destination location. After this the sync should resume

Ahh… well, why didn’t I think if it myself? This worked. Thanks.

Looks like applying the new format to my repos caused some “mysterious” growth in the source repo. I was wondering why my S3 repo sync hadn’t finished this morning and when I canceled the running one and performed a manual one, I noticed that the source repo grew about 172GB, which now have to be synced up to my S3 bucket:

Quick maintenance paused until 2021-07-12 17:35:11 CEST
Synchronizing repositories:
  Source:      Filesystem: /mnt/kopia/ST7A
  Destination: S3: s3.eu-central-1.wasabisys.com kopiast7a
Looking for BLOBs to synchronize...
  Found 29977 BLOBs in the destination repository (698.2 GB)
  Found 37539 BLOBs (870.3 GB) in the source repository, 7570 (172.2 GB) to copy
  Found 7 BLOBs to delete (55.4 MB), 29969 in sync (698.1 GB)
Copying...
  Copied 306 blobs (6.9 GB), Speed: 8.2 Mbit/s, ETA: 44h57m37s (2021-07-14 13:25:50 CEST)

I checked the runs prior to last night and those always got away with this:

Found 28042 BLOBs (654 GB) in the source repository, 278 (6.2 GB) to copy

Can you do kopia blob ls --prefix=x ?

Here we go:

[root@kopia repos]# kopia blob ls --prefix=x
xe1                                                                            12 2021-07-11 18:30:56 CEST
xe2                                                                            12 2021-07-12 07:50:33 CEST
xn1_02d39c6afd0499a1cde082e3a68b1733-s1113524ea14e1397106-c1               270070 2021-07-11 20:45:21 CEST
xn1_14bb0c7fb4e9409da575047a42a0fc53-s2a5ee585e37de855106-c1               336430 2021-07-11 20:35:20 CEST
xn1_35f58ba1a5b1c17a840d061a53d92b0c-sbf6f5405e5f162ca106-c1              2176324 2021-07-11 19:35:23 CEST
xn1_4468f3e3c2c5fa88c7298a91a7743230-sa4f13d9174ca47fc106-c1                46277 2021-07-12 07:48:13 CEST
xn1_5b403dcbd463e2f22cc52ca7a40ccd17-s7dd6caa7c8b12fa3106-c1              2228736 2021-07-11 19:15:22 CEST
xn1_5f376349bcd25ae7ae3fb76aaa3903df-se4829fdda58e7ffd106-c1              2046452 2021-07-11 19:25:22 CEST
xn1_6c01621f769267c66d713173fa4492fb-s445d76d3b20aed0d106-c1               362714 2021-07-11 20:00:20 CEST
xn1_70b56ee19e6341b794405e8621a75937-s54637c9e30ac94fd106-c1               165302 2021-07-11 20:08:48 CEST
xn1_7927524da1c1fd5a317b1f1b6330c3b9-s1a38c66fdf56ed63106-c1              1525087 2021-07-11 21:02:20 CEST
xn1_8328d869dc25b95c3a1fc2447ad9d959-s377befc90be97a6e106-c1               822595 2021-07-11 21:38:29 CEST
xn1_9cdd9f46587e349ac8ea87c99071f674-s775dcee718cc6086106-c1               149053 2021-07-11 19:40:20 CEST
xn1_9f4991f7d5144761ddf5cd003769f2b3-s40e59bcbfb80a677106-c1               187446 2021-07-11 20:25:20 CEST
xn1_a070621d576c2c8d52cdf9cd9159694d-s6d0bdc246e0cffaa106-c1               430383 2021-07-11 20:20:31 CEST
xn1_b09b356bac1212a6e907661e862b32fa-sa3a1839ca2140d75106-c1                40535 2021-07-12 08:09:21 CEST
xn1_d1672ef514ecfcc167cc9ea7e32e661a-s4129c730bae7bd69106-c1               618853 2021-07-11 23:38:50 CEST
xn1_d6f23e0f11ca4240147e101dc79eea9c-s22f68a6688afceb9106-c1               356626 2021-07-11 19:50:20 CEST
xn1_dcb895005c6bacdcafd18d9c4b964eac-s680a5213dd609e9e106-c1              2209798 2021-07-11 19:05:21 CEST
xn1_ee3ed0756ac31e1358dbf6b52ed2b0bb-sd27877db2685ba6a106-c1               302636 2021-07-11 20:10:31 CEST
xn1_f278f13b86c384509d6fa5a3164e8cfa-s1e471dacc5b98bca106-c1               764390 2021-07-11 20:55:21 CEST
xn2_6ba46e4b1598fd5613a28bf990117ba1-sa778622b472d976e106-c1               114235 2021-07-12 14:42:57 CEST
xn2_6cfd715b540b8ec35c17179460fc3ed9-s2b610110a83c991b106-c1                66440 2021-07-12 10:09:31 CEST
xs0_59760cdd12f254c95d2203c267251401-s78fcf92ee35de878-c1                50979134 2021-07-12 08:15:16 CEST
xs0_59760cdd12f254c95d2203c267251401-sf15189854b061f38-c1                50979134 2021-07-12 08:00:20 CEST

Ok, so I don’t think the change in size is due to a change in format, since new code would only produce x files. Can you see if there are some accidentally large files that got included? kopia diff can tell you the difference between two snapshots.

Well, the only other thing I did was to have a snapshot estimate of an new client I wanted to check. Could that somehow have caused this? Although the resulting snapshot was empty in the end…

Could it be simply that you recently added more data?

Hmm… the sizes of my two client’s snapshots didn’t change much over the last days, so to acquire 172GB of extra data, that should show in the snapshot sizes.

Well… yeah… those 172G are actually new data. I had a typo in my attempt to get an estimate for a new client and thus ended up getting more than that… So, basically Kopia ignored the typo and went ahead with creating the actual snapshot. Don’t know, if this is supposed to be, by maybe Kopia should not continue, if there’s anything passed which it can’t parse…?