Local metadata cache for incremental backup

A few months ago, there’s a post about Kopia in Hacker News. Someone mentioned that it’d nice if Kopia has separate metadata files from data files so that he can upload the data files to Amazon S3 while keeping metadata locally for incremental backup. The point is that some tiers of S3 (e.g. Glacier) is extremely cheap for storage and ingress requests, but very expensive for egress data transfer.

Jarek replied that Kopia can easily cache the metadata locally for those purposes. If I understand correctly, this is the $KOPIA_CACHE_DIRECTORY directory? If I keep these directories locally, is it guaranteed that Kopia would always use the local copies of the n and q files unless there’s cache miss, in which case a download from remote repo would happen?

If the cache directory is not for that purpose, is there a mechanism to keep a copy of metadata files locally in the current version?

Side question: if I use local filesystem repo, the cache would be pointless since Kopia can just access the source, right? If so, is there a way to disable cache for local filesystem repo?

1 Like

So I cloned Kopia repo locally, changed the code to support storage class when calling PutObject on S3, then proceed to test if this simple change would work. Unfortunately it doesn’t due to some class not allowing GET.

  1. If I directly supply --storage-class=DEEP_ARCHIVE during repository create s3 command, it will end with error “kopia.exe: error: unable to connect to repository: error connecting to repository: unable to read format blob: unable to complete GetBlob(kopia.repository,0,-1) despite 10 retries, last error: The operation is not valid for the object’s storage class, try --help” Looks like Kopia do some read-after-write operations on S3.

  2. If I --storage-class=STANDARD during repository create s3, I can create the repo. All objects are stored in standard storage class. Now if I change the storage class option in the repo JSON on the fly, then create snapshot, new objects will indeed be stored in the new storage class.

  3. Unfortunately, if the storage class is in Glacier, creating snapshot results in this error: “error running maintenance: error running maintenance: unable to get maintenance params: error looking for maintenance manifest: unable to load manifest contents: error loading manifest content: error getting cached content: unable to complete GetBlob(q872c39f3462b6c40d83b46ced9e7ee78-sbf7032cc1d0a2125108,0,-1) despite 10 retries, last error: The operation is not valid for the object’s storage class”

So at this point, simply supporting storage class option to the S3 repo is probably OK for non-Glacier classes, but not enough for the two Glacier classes. To support Glacier, Kopia probably need to treat metadata blobs differently from data blobs. Currently it seems all blobs are treated identically in the storage layer.

Have you tried conditional storage class based on the blob id prefix? The p blobs contain the bulk of data while q blobs and others contain metadata so it might make sense (not sure about cost) to only put p in the glacier class.

Much better: The only unexpectedly error’d operation is maintenance run --full --safety=none. All other operations are doing fine.

Here is the output for maintenance:

.\kopia --config-file=S3.config maintenance run --full --safety=none
Running full maintenance...
Looking for active contents...
  Processed 36 contents, discovered 36...
Looking for unreferenced contents...
Rewriting contents from short packs...
unable to rewrite content "64d1928fd03b83a50a18756605646ec56d793cbf6b802f040303e055ecee548d": unable to get content data and info: error getting cached content: unable to complete GetBlob(p87bb2c02fd9b6f1ac925b9b3af2e0762-sdc1d5ce7d406f483108,1182885,2987) despite 10 retries, last error: The operation is not valid for the object's storage class
unable to rewrite content "a357ae7bb288742bd0326fc473e2032a69c49ac524c811c42aaa8ca304fc60c7": unable to get content data and info: error getting cached content: unable to complete GetBlob(p87bb2c02fd9b6f1ac925b9b3af2e0762-sdc1d5ce7d406f483108,6559571,1762591) despite 10 retries, last error: The operation is not valid for the object's storage class
unable to rewrite content "a474dbbc30a3d4c7cf58bf2cb67f1d3840c37d1ddaf5fc234879771302ec3c3a": unable to get content data and info: error getting cached content: unable to complete GetBlob(p87bb2c02fd9b6f1ac925b9b3af2e0762-sdc1d5ce7d406f483108,1114261,23878) despite 10 retries, last error: The operation is not valid for the object's storage class
unable to rewrite content "b2664813e60d10d3b066a56fe4a0a15510eaf6b07a0457690640953633c9f2ec": unable to get content data and info: error getting cached content: unable to complete GetBlob(p87bb2c02fd9b6f1ac925b9b3af2e0762-sdc1d5ce7d406f483108,1270393,6316) despite 10 retries, last error: The operation is not valid for the object's storage class
<skipping many lines>
Finished full maintenance.
default (64KiB) - allocated 48 chunks freed 47 alive 1 max 12 free list high water mark: 11
ERROR: error rewriting contents in short packs: failed to rewrite 37 contents

I guess full maintenance would actually modify existing blobs in some way?

Anyways, it’s good progress. Here’s my questions:

  1. Beside the “p” and “q” blobs, I also noticed many “xn”, “n” and “_log” files, in this example and an new repo I recently created after 0.9. They are usually small and fewer, but still take some chunks of storage. However, when I blob list an old repo I created way before, those blobs are extremely few (there were only 3 non-p and non-q blobs in this 2TB repo). Is there an internal doc listing the meaning of all these blob prefixes? Are some of them good candidates for glacier class as well? And why I’m see dramatic increase in number of these blobs in recent repos?

  2. Since glacier is basically a write-only storage, but during backup the metadata will inevitably be read. Can we offer option to store or cache the non-p blobs locally, so that we can completely eliminate the need of S3 standard storage class?

Storage Tiers | Kopia basically answered my questions. So it looks like a solution to support AWS S3 Glacier would be:

  1. Add option to allow user to choose which storage class to store p blobs. Defaults to standard. Update S3 document to suggest user to turn off full maintenance if setting this to Glacier classes.
  2. Add option to allow user to choose which storage class to store other types of blobs. Defaults to standard. Update S3 document to warn user not to set this to Glacier classes.

Anything else? I can do a pull request for these two, if you want.