Question: Can two distinct local repositories sync-to the same remote repository?

dsimmons · May 28, 2022, 7:41pm

Context

I’m currently in the process of re-designing my “backup architecture” and I was hoping to obtain some clarification around this question before I proceed. I think I know the answer, but a confirmation(s) would be helpful!

Topology:

Let “DESKTOP” be my desktop computer.
Let “LAPTOP” be my laptop computer.
Let “LOCAL COPY” be a locally-networked GlusterFS triple-replicated cluster.
Let “OFFSITE COPY” be Backblaze B2 (or similar, but that’s what I’m currently using).

In short, here’s the desired use case:

DESKTOP runs frequent backups, backing up locally to its own filesystem.
LAPTOP runs frequent backups, backing up locally to its own filesystem.
BOTH DESKTOP and LAPTOP use sync-to to redundantly copy to the LOCAL COPY “remote” repo.
BOTH DESKTOP and LAPTOP use sync-to to redundantly copy to the OFFSITE COPY remote repo.

Question

So, in a nutshell, what I’m asking in plain English is: can distinct repositories (in this case, DESKTOP and LAPTOP each have their own local repositories) sync-to the same “remote” repository/destination?

Additional Context

What I’m doing currently:

Both DESKTOP and LAPTOP run snapshots of $HOME using the same repo hosted on LOCAL COPY (network FS).
LOCAL COPY has sync-to invoked on it once-nightly and outputs to B2 (OFFSITE COPY).

Why I want to make changes:

Whereas I currently keep most my personal data under $HOME, I want to start to partition it out and get more granular for (at least) the following reasons:
- Importance/priority: I care about some of the data more than other parts.
- Backup frequency: to the above point, there’s parts of my data that I want to run backups on more frequently (e.g. “hot vs cold”).
- Size: some of my data (e.g. GoPro video) is far larger and would be better suited for different media (e.g. NVMe vs spinning HDD).
- Bandwidth: long story short, in the short-term, I’m capped at 5mbps upload (Comcast grrr), so this imposes significant restrictions on seeding backups over WAN. A fresh backup on a ~500GB working set takes between 1-2 weeks to finish.
Why distinct repos for each of DESKTOP and LAPTOP?
- Currently, my triple-replicated GlusterFS cluster (LOCAL COPY) is a single point of failure. If one of my nodes dies and it goes into read-only mode, I lose the ability to back up for some period of time… and more importantly, with supply chain shortages right now, I might not even be able to find suitable replacements for a reasonable price (using Odroid SBCs that are no longer available). I had a scare recently that really spooked me in this regard.
- I want the ability to continue to back up while completely offline, namely for LAPTOP.
- While LAPTOP can mount LOCAL COPY over WAN via Wireguard VPN, there may be situations where it can only back up to OFFSITE COPY (B2) for one reason or another (e.g. firewall issue). If/when that’s the case, I don’t want to halt my other sync-to backup(s).
When it comes to multiple devices, there’s a “core” set of data (~30GB) that I want to sync between them, 80-90%+ which will be redundant (where the primary differences are dotfiles and installation-specific data).
- That’s to say, it’s largely the same, but there’s also some differences (where content-addressed storage comes into play nicely).
- As for the remainder (~500-600GB presently), it doesn’t necessarily need to travel with me everywhere I go (at least “locally” – I can make it available via a network share or similar), nor does it need to be backed up as frequently.
- However, currently, I’m backing up 100% of my data everywhere, every time, and I’m finding myself wanting more granularity/flexibility.

I sincerely appreciate those that’ve taken the time to read this far and/or chime in, hopefully this all makes sense! Happy to clarify anything that I may have done a poor job of explaining as far as my use case goes!

jkowalski · May 29, 2022, 1:41pm

Thanks for the write-up. In general sync-to can’t really be used to merge repositories because index formats make that impossible. There may be other ways, though:

I have few observations:

Relying on sync-to as a safety measure has an issue where you may be replicating data corruption (bitrot or bugs) at the source because sync-to is just copying files and not performing consistency checks as it does it.
I would recommend against using fancy networked filesystems (like GlusterFS) because they sometimes don’t provide expected consistency guarantees (like read-after-write, list-after-write). Not sure if Gluster has those, but I would stay away from that as much as possible.
I would similarly recommend using good filesystems with bitrot protection (zfs, btrfs) and performing regular scrubs of data to ensure it’s still ok (this is critical, otherwise bitrot protection does not really do its job).
Try to make data recovery as trivial as possible, each layer of complexity can make it harder or slower to recover. I would consider eliminating or simplifying VPNs, GlusterFS if you can do it. Dealing with data loss is one thing when it happens, but not being able to get to the backup is another (when VPN is suddenly misconfigured or when Gluster decided to develop a split brain, etc.)

Overall instead of synchronizing I would generally use several independent repositories and backup to them individually, possibly on a different schedule as much as the bandwidth allows it.

Each machine backs up all data to its local repository.
Each machine backs up all data to the other machine’s repository (desktop->local, local->desktop), not sure how practical it is, using something like SFTP or webdav or perhaps even
Each machine backs up only important data to the cloud while you’re internet bandwidth constrained.
And most important, test your backups regularly.

dsimmons · May 29, 2022, 6:28pm

Sincerely thank you (yet again!) @jkowalski for your time and a speedy reply, a detailed response, and of course, your work on this incredible project! I probably said this the last time you gave me a similar response in another thread, but Kopia was exactly what I was after for years!

I would recommend against using fancy networked filesystems (like GlusterFS) because they sometimes don’t provide expected consistency guarantees (like read-after-write, list-after-write). Not sure if Gluster has those, but I would stay away from that as much as possible.

I would similarly recommend using good filesystems with bitrot protection (zfs, btrfs) and performing regular scrubs of data to ensure it’s still ok (this is critical, otherwise bitrot protection does not really do its job).

I did see this in the Kopia documentation recently! Maybe I missed it when I first started using Kopia? In any case, unfortunately, I noticed it well after I had a process/solution in place and had several months worth of Kopia backups using this strategy.

For what it’s worth, I do have bitrot detection and scrubbing (see link if interested) enabled on my Gluster cluster, but point taken! I mentioned in my train of thought above that I didn’t like this cluster being a single point of failure, and although I didn’t say it explicitly, another equally significant factor in my decision-making process was having discovered the advice around not using networked filesystems in the documentation recently (that too made me uneasy about having it in my “critical path”).

It may also be worth mentioning that, originally, when I began utilizing that networked filesystem as my “source of truth” for backing up my devices, it was largely due to space constraints, as I didn’t have disk capacity to keep both my “working set” and its corresponding backup on the same machine(s). I’ve since increased each device’s capacity, and, like you read above, I’m revisiting how I “partition” my overall working set and backups. So for all of those reasons combined, I have more options on the table now, so I’m consequently looking to remove Gluster from being a primary part of the equation (whereas I might still use it for added redundancy).

Relying on sync-to as a safety measure has an issue where you may be replicating data corruption (bitrot or bugs) at the source because sync-to is just copying files and not performing consistency checks as it does it.

Just to clarify, are you saying that sync-to isn’t worth using in any capacity?

As an example, on DESKTOP and LAPTOP, my crontab entry reads as follows:

*/30 * * * * kopia snapshot create $HOME && kopia snapshot verify --verify-files-percent=1

As an additional related question, does --verify-files-percent help alleviate the issues you’ve raised with sync-to around verifying consistency?

Overall instead of synchronizing I would generally use several independent repositories and backup to them individually, possibly on a different schedule as much as the bandwidth allows it.

re “individually”: as you probably guessed, my hope in “sharing” a destination repository for N distinct source repositories was so that I could make use of the content-addressing to save space and eliminate redundancy. But it sounds like, per your comment about the index format, it’s not possible!

Just to ensure that I’m following what you’re prescribing in your advice, it sounds like you’re recommending I should have a unique “destination repository” for each local/source repository, even where the majority of the data is shared? So graphically:

DESKTOP (A) snapshots to its own local filesystem, then to a remote copy (A).
LAPTOP (B) snapshots to its own local filesystem, then to a remote copy (B).
Stated explicitly, I’m assuming A != B (each is unique, despite having largely the same data).

Thanks again!

iBackup · June 1, 2022, 10:20am

If both, DESKTOP and LAPTOP share some files(so you can greatly benefit from deduplication), then I would setup a single, dedicated local repository with kopia server, then create on a server 3 distinct users: one is global admin and two clients - desktop & laptop then you can push snapshots to a single repository that has reliable local filesystem like ZFS then use sync-to for offsite backups from repository server. This way you can protect your files not only from bitrot but also from more real treat like ransomware that can overwrite your backups, but in case of using dedicate repository server, you can setup as a global admin a policy for DESKTOP & LAPTOP to use APPEND only mode, so in case if original files would be encrypted by ransoware it won’t destroy previous snapshots.

If you can’t setup reliable filesystem on repository server (a dedicated computer), then you can use very old utility mtree (or Go port: go-mtree on github) that used for checking integrity of directories and files, so if you would run it periodically on repository, you would be sure that repository is Ok. For some very paranoiac cases, one can even use par2 utility to protect repository from damages on file system level.

To check consistency offsite backups that you feed with sync-to from repository server, you can periodically connect to and run verification over it to be make sure remote repo isn’t damaged.

Topic		Replies	Views
Configure multiple repositories Feature Requests	4	5686	June 7, 2022
Duplicate repo to another location Support	5	1420	May 23, 2022
How to automatically sync snapshots to multiple repositories with KopiaUI? Support	12	4657	June 6, 2022
Split and then merge repository? Support	5	520	June 9, 2022
Same Snapshot on Multiple Devices Support	6	1793	August 31, 2020