Maximum usable size of the repository? Petabyte scale possible?

We are looking for a backup software allowing to easily push various assorted pieces of data to a central repository for archiving / disaster recovery scenario.

The data lives in different places, some of this stuff is on local disks, but we also got a bunch storage platforms (accessible over NFS), as well as things like Gluster, Ceph etc. - and all of these have got some sort of replication / snapshotting / DR plan in place, the idea here is to archive important data to an offsite system, to have a ‘worst case scenario’ recovery option. In total we are talking probably about 1PB of stuff or so here.

At the moment we are doing all of this via bunch of primitive rsync scripts to a box where 1PB btrfs block volume is mounted from SAN + some light btrfs snapshotting (to enable ‘point-in-time’ recovery) - and the solution works, but it is clunky and not very maintainable, tempted to use something better here, with a good cli interface, client/server model, index searching feature, ‘smart’ incremental backups, ability to monitor and manage the state, etc.

We tried a bunch of options, like restic and borg - and while interface and features are great and exactly what we look for, it looks like all these chunking/deduplicating pieces of software are choking rather easily once certain amount of data/chunks is pumped into the repository (because of the even increasing overhead of doing the dedupe checks on hash index), for instance restic gave up after around 15TB, with ETA only ever increasing since, so the 100TB backup job would never complete in this case…

Kopia on the other hands seems to be handling this better, we are running the same test job right now, with 40TB pumped so far - and no slowdown in sight:

35 hashing, 14279300 hashed (39.7 TB), 41026 cached (141.6 GB), uploaded 39.6 TB, estimated 94.2 TB (42.3%) 34h21m30s left.6 TB, estimated 94.2 TB (42.3%) 34h21m33s left

I am wondering if anyone tried kopia at this scale? Has it got any chance of working well at all (including dealing with future incremental backups and repo management / pruning / etc.)? Or should I just give up and go back to simple and trusted rsync ‘solution’? I don’t particularly care about deduplication or encryption that much (most of the data is compressed already and unique - and storage platform is completely under our control), so if there are any tweaks/config changes possible to make kopia perform and scale better for this specific use case, I would be happy to try them.

1 Like