Cache size recommandation


One of my VMs has a small root filesystem and kopia cache takes almost 50% of the available size.
Do you have a cache size recommandation, especially if I want to shrink it?

You can define the cache sizes in the config or script from which you’re running kopia. Or you could have the cache at another location an exclude that folder from being snapshot.

thanks for your answer. I’m in fact backing up another filesystem, my question was more something like "do I need a cache if I back up once a day? I don’ really care if it is efficient " or “will kopia manage with only one GB of cache if I shrink the default cache?”

Well… it’s not that Kopia will have to use the configured - or default - cache sizes, it there isn’t enough to fill it. The cache is mostly used not to cache actual blobs of data, but rather metadata, which will speed up the snapshot and other operations. There is a command which will give you a rough estimate of how large your cache might get, but I’d t look that up, since I am not knowing it from the top of my head.

Are you using kopia server or KopiaUI?

So, try this on your kopia client and see what it spits out:

kopia cache info

To give you an example: I am running kopia on my Mac and my repo is about 700GB in size. My cache folders accounts for this much space:

stephan.budach@stephan ~ % kopia cache info
/Users/stephan.budach/Library/Caches/kopia/b13930a83facb640/blob-list: 4 files 4.4 KB (duration 30s)
/Users/stephan.budach/Library/Caches/kopia/b13930a83facb640/contents: 75 files 25.8 MB (limit 5.2 GB)
/Users/stephan.budach/Library/Caches/kopia/b13930a83facb640/indexes: 23 files 55.6 MB
/Users/stephan.budach/Library/Caches/kopia/b13930a83facb640/metadata: 95 files 1.3 GB (limit 5.2 GB)
/Users/stephan.budach/Library/Caches/kopia/b13930a83facb640/own-writes: 0 files 0 B
To adjust cache sizes use 'kopia cache set'.
To clear caches use 'kopia cache clear'.

So as you can see, kopia is not even close to the max. allowed limit and I could just reduce the cache size by using

kopia cache set

thanks ! seems exactly what I need. I use kopia in CLI (not UI)

Wonder if one could have some thing that checks total disk size. For a <10G disk, defaulting to 5G seems like overkill, but for any other “normal” setup, 5G is probably a decent choice.

Wonder if one could have some thing that checks total disk size.

df -h


I meant that when you set the client up for the first time and it decides “5G is a good default size” it could check the disk size it runs on, and if the cache is larger than say 50% of the whole drive, then that size is silly and it should aim for something far smaller. Not that I should do it manually on each client.

There is a command in Kopia CLI which you can run to estimate how much cache you should provide. This is for metadata, which is by far the most critical. If you run

kopia blob stats --prefix=q

which you then should multiply by 2x up to 5x as posted here:

When I run this on my 1 TB repo, I am getting this…

[root@kopia repos]# kopia blob stats --prefix=q
Count: 100
Total: 1.7 GB
Average: 16.7 MB

    0 between 0 B and 10 B (total 0 B)
    0 between 10 B and 100 B (total 0 B)
    0 between 100 B and 1 KB (total 0 B)
    3 between 1 KB and 10 KB (total 25.4 KB)
    4 between 10 KB and 100 KB (total 67.7 KB)
    0 between 100 KB and 1 MB (total 0 B)
    6 between 1 MB and 10 MB (total 27.4 MB)
   87 between 10 MB and 100 MB (total 1.6 GB)

So… 1.6 x 5 is a bit above 5 GB, but I don’t expect this repo to grow much larger, so I’d be probably okay with even 2 GB…

1 Like

That’s what I get for a repo of ~3.7TB total size:

Got 10000 blobs...
Count: 16224
Total: 67.6 GB
Average: 4.2 MB

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
        0 between 100 B and 1 KB (total 0 B)
      866 between 1 KB and 10 KB (total 4 MB)
     5217 between 10 KB and 100 KB (total 240.9 MB)
     2191 between 100 KB and 1 MB (total 823.2 MB)
     6079 between 1 MB and 10 MB (total 29.6 GB)
     1871 between 10 MB and 100 MB (total 36.9 GB)

That’s quite huge!

Well… it may depend on your change ratio and structure of your data. Let me check that on my two other repos…

This is a 2.8 TB repo, which holds the backups of a mail cluster:

[root@kopia ~]#  kopia blob stats --prefix=q
Count: 3773
Total: 71.9 GB
Average: 19.1 MB

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
        0 between 100 B and 1 KB (total 0 B)
      101 between 1 KB and 10 KB (total 476.3 KB)
       15 between 10 KB and 100 KB (total 484 KB)
       23 between 100 KB and 1 MB (total 12 MB)
      182 between 1 MB and 10 MB (total 0.9 GB)
     3452 between 10 MB and 100 MB (total 71 GB)

So… 71 GB for 2,8 TB

And this one here…

[root@kopia kopia]#  kopia blob stats --prefix=q
Got 10000 blobs...
Got 20000 blobs...
Got 30000 blobs...
Got 40000 blobs...
Got 50000 blobs...
Count: 57871
Total: 145.9 GB
Average: 2.5 MB

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
        0 between 100 B and 1 KB (total 0 B)
    44496 between 1 KB and 10 KB (total 192.5 MB)
     3531 between 10 KB and 100 KB (total 128.3 MB)
     1205 between 100 KB and 1 MB (total 329.8 MB)
      132 between 1 MB and 10 MB (total 258 MB)
     8507 between 10 MB and 100 MB (total 145 GB)

This is a 56 TB repo of file server, that has been kopia-snapshotted for at least a year now. As you can see, its “only” 145 GB for 56 TB. So, maybe the chunksize/blobsize needs to be adjusted to the type of data/filesizes, which are mostly brought into the repo.

Not that I should do it manually on each client.

echo "Calulating, please wait..."                                                                                                                                                           
rc=$(kopia blob stats --prefix=q)                                                                                                                                                           
size_in_MB=$(echo "${rc}" | awk '/^Total:/ {                                                                                                                                                
 print ($3 == "MB") ? int($2*5) : int($2*5*1000)                                                                                                                                            
[ -n "${size_in_MB}" ] && {                                                                                                                                                                 
  printf "\n%s\n\n" "Required cache size: ${size_in_MB} MB"                                                                                                                                 
  printf "Run following command to setup new cache policy:\n"                                                                                                                               
  printf "%s%d\n\n" "kopia cache set --metadata-cache-size-mb=" ${size_in_MB}                                                                                                               
} || {                                                                                                                                                                                      
  printf "\nErr: can't calculate kopia's cache size\n\n"                                                                                                                                    
  exit 1                                                                                                                                                                                    
### uncomment for actual cache setting                                                                                                                                                      
#kopia cache set --metadata-cache-size-mb=${size_in_MB}

Based on the numbers folks are reporting here and elsewhere, it looks like real-world usage ends up producing more metadata cache than what I was anticipating. We will need to improve metadata formats to reduce the need to cache as much.

For maximum efficiency the cache needs to hold at least:

  • the complete most recent directory listing for each snapshot source (to faciliate incremental snapshots)
  • all manifests (snapshots, policies)

There are at least 3 issues today that make this less efficient than it could be:

  1. Metadata cache is based on whole blobs (q), which reduces the number of fetches from the store at the expense of storing more than is needed. We can try changing this (maybe with a flag) - it should be easy to try - the data/content cache already supports both full blobs and partial contents. It may turn out that the cost of sweeping the cache is greatly increased (due to having to store huge number of small files), so perhaps some completely new cache format (not file-per-cache-item) will need to be developed.

  2. Directory objects are not currently compressed. This is hard to fix with v1 format but trivial with v2 format (which has been in use since v0.9), we should absolutely do it and it’s a cheap fix. Enable internal directory compression for `k` contents · Issue #1541 · kopia/kopia · GitHub

  3. Currently directories are always written in non-incremental way. A large directory with a single change between snapshots is going to be stored (effectively) twice without leveraging any compression or differential storage that would be possible here.
    Sharded directory format for large directories with lots of files spread over time. · Issue #1542 · kopia/kopia · GitHub has one idea for improving it. This would require rolling out new repository format as old kopia clients won’t be able to read it.

My time these days is quite limited (having recently switched jobs) and 1&3 are quite large projects. I would really love it if somebody from the community could help implement those changes. I will be happy to guide new contributors making those changes and hoping they become more regular Kopia contributors.

Please ping me on Slack if you’re interested.

I’m wondering if the kopia cache is needed for performance on a local filesystem repository server itself?

kopia cache (mainly metadata) filled up the root filesystem of the VM running the repository server with local files on another disk.

$ kopia cache info
/home/kopia/.cache/kopia/52c2f41b2feb9ade/blob-list: 229 files 4.5 KB (duration 30s)
/home/kopia/.cache/kopia/52c2f41b2feb9ade/contents: 7 files 10.8 MB (limit 5.2 GB, min sweep age 10m0s)
/home/kopia/.cache/kopia/52c2f41b2feb9ade/indexes: 1114 files 203.4 MB
/home/kopia/.cache/kopia/52c2f41b2feb9ade/metadata: 3714 files 4.8 GB (limit 5.2 GB, min sweep age 24h0m0s)
/home/kopia/.cache/kopia/52c2f41b2feb9ade/own-writes: 43 files 52 B

(local file repository is 994 GB)

1 Like