Maximum usable size of the repository? Petabyte scale possible?

We are looking for a backup software allowing to easily push various assorted pieces of data to a central repository for archiving / disaster recovery scenario.

The data lives in different places, some of this stuff is on local disks, but we also got a bunch storage platforms (accessible over NFS), as well as things like Gluster, Ceph etc. - and all of these have got some sort of replication / snapshotting / DR plan in place, the idea here is to archive important data to an offsite system, to have a ‘worst case scenario’ recovery option. In total we are talking probably about 1PB of stuff or so here.

At the moment we are doing all of this via bunch of primitive rsync scripts to a box where 1PB btrfs block volume is mounted from SAN + some light btrfs snapshotting (to enable ‘point-in-time’ recovery) - and the solution works, but it is clunky and not very maintainable, tempted to use something better here, with a good cli interface, client/server model, index searching feature, ‘smart’ incremental backups, ability to monitor and manage the state, etc.

We tried a bunch of options, like restic and borg - and while interface and features are great and exactly what we look for, it looks like all these chinking/deduplicating pieces of software are choking rather easily once certain amount of data/chunks is pumped into the repository (because of the even increasing overhead of doing the dedupe checks on hash index), for instance restic gave up after around 15TB, with ETA only ever increasing since, so the 100TB backup job would never complete in this case…

Kopia on the other hands seems to be handling this better, we are running the same test job right now, with 40TB pumped so far - and no slowdown in sight:

35 hashing, 14279300 hashed (39.7 TB), 41026 cached (141.6 GB), uploaded 39.6 TB, estimated 94.2 TB (42.3%) 34h21m30s left.6 TB, estimated 94.2 TB (42.3%) 34h21m33s left

I am wondering if anyone tried kopia at this scale? Has it got any chance of working well (including dealing with future incremental backups and repo management / pruning / etc.)? Or should I just give up and go back to simple and trusted rsync ‘solution’? I don’t particularly care about deduplication or encryption that much (data is generally already compressed and unique - and we do control the storage platform completely), so if there are any tweaks/config changes possible to make kopia perform and scale better for this specific use case, I would be happy to try them.

I am running a Kopia repo which has grown to 73TB. The size itself seems not to be an issue, but depending on the number of snapshots, you might run into the issue I encountered recently, where the remote Kopia client is unable to list its snapshots due to an error invoking the GRPC API.

This actually means, that I am currently not able to restore any data on the remote client itself, since it can’t get to the snapshots manifests. The local Kopia client on the repo server doesn’t have this issue and is still able to list all the snapshots and thus also to retrieve data from them.

I did open an issue on Github for that and also tried to ping @jkowalski on slack, because I deem this a major issue for Kopia.

Thanks for providing extra color @budy!

I managed to pump around 100TB - so far things are still working well, backup/restores are quick.
Running command like content stats though takes a few minutes now, this might become very hard to use as the repo grows.

Count: 59584984
Total Bytes: 103.8 TB
Total Packed: 97.2 TB (compression 6.4%)
By Method:
  (uncompressed)         count: 50966841 size: 95.4 TB
  zstd-fastest           count: 8618143 size: 8.4 TB packed: 1.8 TB compression: 78.9%
Average: 1.7 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
    43097 between 10 B and 100 B (total 3.4 MB)
  8275038 between 100 B and 1 KB (total 3 GB)
  1884766 between 1 KB and 10 KB (total 6.3 GB)
  3259809 between 10 KB and 100 KB (total 192.3 GB)
 16457045 between 100 KB and 1 MB (total 7 TB)
 29665229 between 1 MB and 10 MB (total 90 TB)
        0 between 10 MB and 100 MB (total 0 B)

We do not expect to have a crazy number of snapshots (maybe a dozen of new ones per day), so the bug you mention is less of an issue. I will keep testing and report back if I spot any issues / glitches.

Do you run a client/server setup or are you running Kopia on the host, your’re backing up from?

On my Kopia server, a content stats runs for approx. 1 min:

[root@jvmhh-archiv kopia]# time kopia content stats
Count: 31484294
Total Bytes: 78.3 TB
Total Packed: 78.3 TB (compression 0.0%)
By Method:
  (uncompressed)         count: 30946136 size: 78.3 TB
  zstd-fastest           count: 538158 size: 1.5 GB packed: 491.7 MB compression: 67.5%
Average: 2.5 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
    18687 between 10 B and 100 B (total 1.3 MB)
  3090857 between 100 B and 1 KB (total 1.4 GB)
  2646966 between 1 KB and 10 KB (total 9.4 GB)
  2613183 between 10 KB and 100 KB (total 111.9 GB)
  2933460 between 100 KB and 1 MB (total 1.2 TB)
 20181140 between 1 MB and 10 MB (total 77 TB)
        1 between 10 MB and 100 MB (total 23.9 MB)

real	1m15,097s
user	1m29,295s
sys	0m8,554s

I got client/server setup, but only for pushing backups from remote clients. The stats command actually ran on directly on the server. It takes around 3 minutes, but I also got over twice as many objects as in yours - so not too dramatic right now, although could be a scaling annoyance in the future:

Count: 76230783
Total Bytes: 109.9 TB
Total Packed: 102.1 TB (compression 7.1%)
By Method:
  (uncompressed)         count: 58921965 size: 98 TB
  zstd-fastest           count: 17308818 size: 12 TB packed: 4.2 TB compression: 65.1%
Average: 1.4 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
   581421 between 10 B and 100 B (total 46 MB)
 15056511 between 100 B and 1 KB (total 5.5 GB)
  5714141 between 1 KB and 10 KB (total 19.9 GB)
  6223891 between 10 KB and 100 KB (total 299.4 GB)
 17838153 between 100 KB and 1 MB (total 7.4 TB)
 30816666 between 1 MB and 10 MB (total 94.4 TB)
        0 between 10 MB and 100 MB (total 0 B)

real    3m2.118s
user    5m12.038s
sys     0m11.329s

Yeah, that’s like my setup as well. Pushing snapshots from the remote client to the server is not an issue. My issue is, that I am not able to get any data back via the remote client, since it fails to get to the manifest data. I can migitate this by limiting the number of snapshots fetched for each source but this will only remedy the listing of snapshots - it doesn’t work with restoring, even if you know, which snapshot to restore from. I can always restore on the Kopia server, but that’s just far from optimal.