Maximum usable size of the repository? Petabyte scale possible?

We are looking for a backup software allowing to easily push various assorted pieces of data to a central repository for archiving / disaster recovery scenario.

The data lives in different places, some of this stuff is on local disks, but we also got a bunch storage platforms (accessible over NFS), as well as things like Gluster, Ceph etc. - and all of these have got some sort of replication / snapshotting / DR plan in place, the idea here is to archive important data to an offsite system, to have a ‘worst case scenario’ recovery option. In total we are talking probably about 1PB of stuff or so here.

At the moment we are doing all of this via bunch of primitive rsync scripts to a box where 1PB btrfs block volume is mounted from SAN + some light btrfs snapshotting (to enable ‘point-in-time’ recovery) - and the solution works, but it is clunky and not very maintainable, tempted to use something better here, with a good cli interface, client/server model, index searching feature, ‘smart’ incremental backups, ability to monitor and manage the state, etc.

We tried a bunch of options, like restic and borg - and while interface and features are great and exactly what we look for, it looks like all these chinking/deduplicating pieces of software are choking rather easily once certain amount of data/chunks is pumped into the repository (because of the even increasing overhead of doing the dedupe checks on hash index), for instance restic gave up after around 15TB, with ETA only ever increasing since, so the 100TB backup job would never complete in this case…

Kopia on the other hands seems to be handling this better, we are running the same test job right now, with 40TB pumped so far - and no slowdown in sight:

35 hashing, 14279300 hashed (39.7 TB), 41026 cached (141.6 GB), uploaded 39.6 TB, estimated 94.2 TB (42.3%) 34h21m30s left.6 TB, estimated 94.2 TB (42.3%) 34h21m33s left

I am wondering if anyone tried kopia at this scale? Has it got any chance of working well (including dealing with future incremental backups and repo management / pruning / etc.)? Or should I just give up and go back to simple and trusted rsync ‘solution’? I don’t particularly care about deduplication or encryption that much (data is generally already compressed and unique - and we do control the storage platform completely), so if there are any tweaks/config changes possible to make kopia perform and scale better for this specific use case, I would be happy to try them.

1 Like

I am running a Kopia repo which has grown to 73TB. The size itself seems not to be an issue, but depending on the number of snapshots, you might run into the issue I encountered recently, where the remote Kopia client is unable to list its snapshots due to an error invoking the GRPC API.

This actually means, that I am currently not able to restore any data on the remote client itself, since it can’t get to the snapshots manifests. The local Kopia client on the repo server doesn’t have this issue and is still able to list all the snapshots and thus also to retrieve data from them.

I did open an issue on Github for that and also tried to ping @jkowalski on slack, because I deem this a major issue for Kopia.

2 Likes

Thanks for providing extra color @budy!

I managed to pump around 100TB - so far things are still working well, backup/restores are quick.
Running command like content stats though takes a few minutes now, this might become very hard to use as the repo grows.

Count: 59584984
Total Bytes: 103.8 TB
Total Packed: 97.2 TB (compression 6.4%)
By Method:
  (uncompressed)         count: 50966841 size: 95.4 TB
  zstd-fastest           count: 8618143 size: 8.4 TB packed: 1.8 TB compression: 78.9%
Average: 1.7 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
    43097 between 10 B and 100 B (total 3.4 MB)
  8275038 between 100 B and 1 KB (total 3 GB)
  1884766 between 1 KB and 10 KB (total 6.3 GB)
  3259809 between 10 KB and 100 KB (total 192.3 GB)
 16457045 between 100 KB and 1 MB (total 7 TB)
 29665229 between 1 MB and 10 MB (total 90 TB)
        0 between 10 MB and 100 MB (total 0 B)

We do not expect to have a crazy number of snapshots (maybe a dozen of new ones per day), so the bug you mention is less of an issue. I will keep testing and report back if I spot any issues / glitches.

Do you run a client/server setup or are you running Kopia on the host, your’re backing up from?

On my Kopia server, a content stats runs for approx. 1 min:

[root@jvmhh-archiv kopia]# time kopia content stats
Count: 31484294
Total Bytes: 78.3 TB
Total Packed: 78.3 TB (compression 0.0%)
By Method:
  (uncompressed)         count: 30946136 size: 78.3 TB
  zstd-fastest           count: 538158 size: 1.5 GB packed: 491.7 MB compression: 67.5%
Average: 2.5 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
    18687 between 10 B and 100 B (total 1.3 MB)
  3090857 between 100 B and 1 KB (total 1.4 GB)
  2646966 between 1 KB and 10 KB (total 9.4 GB)
  2613183 between 10 KB and 100 KB (total 111.9 GB)
  2933460 between 100 KB and 1 MB (total 1.2 TB)
 20181140 between 1 MB and 10 MB (total 77 TB)
        1 between 10 MB and 100 MB (total 23.9 MB)

real	1m15,097s
user	1m29,295s
sys	0m8,554s

I got client/server setup, but only for pushing backups from remote clients. The stats command actually ran on directly on the server. It takes around 3 minutes, but I also got over twice as many objects as in yours - so not too dramatic right now, although could be a scaling annoyance in the future:

Count: 76230783
Total Bytes: 109.9 TB
Total Packed: 102.1 TB (compression 7.1%)
By Method:
  (uncompressed)         count: 58921965 size: 98 TB
  zstd-fastest           count: 17308818 size: 12 TB packed: 4.2 TB compression: 65.1%
Average: 1.4 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
   581421 between 10 B and 100 B (total 46 MB)
 15056511 between 100 B and 1 KB (total 5.5 GB)
  5714141 between 1 KB and 10 KB (total 19.9 GB)
  6223891 between 10 KB and 100 KB (total 299.4 GB)
 17838153 between 100 KB and 1 MB (total 7.4 TB)
 30816666 between 1 MB and 10 MB (total 94.4 TB)
        0 between 10 MB and 100 MB (total 0 B)

real    3m2.118s
user    5m12.038s
sys     0m11.329s

Yeah, that’s like my setup as well. Pushing snapshots from the remote client to the server is not an issue. My issue is, that I am not able to get any data back via the remote client, since it fails to get to the manifest data. I can migitate this by limiting the number of snapshots fetched for each source but this will only remedy the listing of snapshots - it doesn’t work with restoring, even if you know, which snapshot to restore from. I can always restore on the Kopia server, but that’s just far from optimal.

I pulled the trigger on it, got ~257TB in the repo for now (after compression) and pumping more:

Count: 126900820
Total Bytes: 283.2 TB
Total Packed: 257.6 TB (compression 9.0%)
By Method:
  (uncompressed)         count: 86253380 size: 193.6 TB
  zstd-fastest           count: 40647440 size: 89.6 TB packed: 64.1 TB compression: 28.5%
Average: 2.2 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
   927062 between 10 B and 100 B (total 73.4 MB)
 18537306 between 100 B and 1 KB (total 6.9 GB)
  8198209 between 1 KB and 10 KB (total 29 GB)
  9680197 between 10 KB and 100 KB (total 437.5 GB)
 22635108 between 100 KB and 1 MB (total 9.6 TB)
 66922938 between 1 MB and 10 MB (total 247.6 TB)
        0 between 10 MB and 100 MB (total 0 B)

Things seem to be OK for now, everything works (well, besides this weird corruption thing, unclear what caused it still - luckily it was easily fixable, once you know what is going on).

We will see how it goes, but so far I am very impressed by the design and engineering behind Kopia - it very elegant, feature-rich, efficient (multithreading everywhere, no hard locks and bottlenecks on repos/resources, with ‘eventually consistent’ model) and seem to scale well.

1 Like

“Look at size of that thing…” :slight_smile:

My biggest Kopia archive is about 2TiB. So… wow. Glad to know, that Kopia scales well (Duplicati and Restic are more problematic in this area).

Hi,

I would like to know what is the base system configuration for krs running maintenance on a repositories in TB’s?

How many processor cores and memory is utilised?

Thanks,
Regards,
SD

That really depends on your setup and workload. A KRS handling a single KC doesn’t need much in term of cores or memory usage. You should be attentive to the number of snapshots in your repo, though.

In fact, the number of snapshots seem more of a concern as the size of the repo, at the moment. When my 65TB repo passed 196k snapshots, KRS even got flaky and would crash after some days of runtime. So, if you start out with KRS/KC, its probably a wise decision, to set your global snapshot policy to not store identical snapshots from the start.

Things get more involved, as you either up the number of KCs that connect to KRS at the same time. E.g. my setup involves a Samba server with hundreds of shares, which I want as separate sources in KRS. So, I wrote a little script, which will invoke up to 10 KCs (using --parallel 10) simultaneously to plow through those shares to be able to finish my daily snapshots within one hour or so. In this case, all of the cores on my KRC server are utilized but the system is barely using more then 32GB of RAM (out of 256).

Since the main work is done by KC anyway, you should pay attention to that as well. A fast CPU for hashing/encrypting is key for performance, as is a fast local cache of the appropriate size.

1 Like

@budy Sorry for the offtopic but I’m curious, do you use KopiaUI? If so how well does it performs with big amounts of snapshots (like 3K), I understand that with your huge amount of snapshots it might be somewhat laggy but what about when you only had 3K or 4K snapshots?

I’m more concerned on the restore process, with restic is a bit complicated and ugly being that in Windows it restores everything inside a sub-directory when you backup entire drives, but considering that Kopia has a UI and that you actually can browse the snapshots without much problems (without much waits in between operations I mean, I use rclone backend) I would think the restoring is smooth too?

Being able to browse the snapshots is what I like most of Kopia (and KopiaUI), the backup process I can make it with scripts so no problem there.

From my experience, it is the overall number of snapshots in a repository, which causes Kopia to slow down, when it comes to navigating these snapshots in general.

My repo had grown to accomodate approx. 196k snapshots, when Kopia server started to have serious issues. At that time I had approx. 440 sources with 400 snapshots each.

What I can say is, since I reduced the number of snapshots in the repo by deleting all duplicate snapshots and thus reducing the number to 20k+ snapshots, Kopia Web is as snappy as it was at the beginning. However, I don’t have a single source with 3k+ snapshots, so I can not say, that this could not be an issue - allthough I highly doubt it.

3 Likes

On the KC agent side, not server side?

Great info, thx!

For running a regular snapshot/backup… yes. I guess that maintenance will be performed on the server side, so rewriting packs on the server would be on the server hardware. Just for illustration, I ran up to 10 KC agents in parallel to achieve my 400+ snapshots in 2 hrs of time at night. That was an a dual cpu 8-core system. In the end the most likely bottleneck will be the storage performance… :wink:

1 Like

Trying to avoid another thread, but will create if necessary.

I’m trying to get a handle on general sizing guide for repository server setup.

If you’re backing up xGB of data
You should have xGB (x% of backed up data ? ) for a cache
You should have xGB RAM
You should have x cores for every x clients.
You should consider x for every x clients to avoid bottlenecks
You should monitor x for total snapshots in the repo and keep to less than x
You should modify x scheduler to optimize as your repo server grows

I’m trying to fill in all the x’s :slight_smile:

Of course this would be a very generalized guide that would just give general recommendations. Plan on doing some PR’s to improve the docs (if I can figure it out) once I can summarize all this together and have a better grasp on everything.

I’m wondering how you can all pump so much data into Kopia and call it stable? Especially now that I read about corruption I’m like yea …

I’ve evaluated Borg, then Restic (6TB, 6.9M files), now I’m trying Kopia …

From the reports I was very much looking forward to it, especially in regards to handling more data then the former in a better way - but then I got stuck with the Kopia Repository Server.

Not just bad documentation, after getting TLS to finally work with my externally created Certificates after creating repository it just wont start anymore … midly frustrating to totally offturning.

This all far from the experience I had with something like Restic or Borg even years ago - something that I would absolutely not like to see in a Backup/Restore solution that I will have to trust my data with.

Maybe that’s just bound to the Kopia Repository Server itself?

Is everyone here using Minio S3 or some even AWS? Is that data you really don’t need? How are you using Kopia?

@jit010101 to whom are you referring? I did not have any corruption issues with my repo on my 80+TB repo. We so suffer from some other issue, which seems to be related to ram issues, when Kopia will just bugger out, but all of my maintenance tasks perform without issues.