Maintenance in high-density backup scenarios - Improving maintenance time

We have a solution using Kopia to back up 35 disks for disaster recovery purposes, with the backup frequency being every 10 minutes. For this, we have developed an interface with the Kopia golang library which implements the backup/restore process.

In this situation, we need to keep the backup time as low as possible so we can complete the backups within the 10-minute window. One of the main contributing factors to the backup time is the first-load of the Kopia repository. Once the Kopia repository is opened the instance is cached, however, the agents that perform the backup are ephemeral and may restart at any time. After a restart, they will need to re-open the repository.

The main delay when opening the repository comes from loading the index blobs for the uncompacted epochs. In our situation, we take 35 backups every 10 minutes, which per hour leads to roughly 210 index blobs or ~5000 in a 24-hour window.

We have adjusted the minimum epoch duration from the default 24 hours down to 1 hour, meaning that hopefully at most there are 500 active index blobs (210 for the active epoch and 210 for the previous uncompacted epoch).

One issue with the maintenance setup is that this often takes a long time to run, in most cases more than 3 hours to complete. This presents a problem, waiting for full maintenance to complete delays the rotation of the active epoch to a new epoch, causing the index count to grow and delaying the opening of a repository if the agent is reset. Once the delayed epoch is rotated, it may still exist uncompacted until the maintenance is run again (even if the maintenance can be run on time to meet the minimum epoch duration).

Taking a look at the maintenance code shows that snapshot garbage collection is run before index maintenance. The snapshot garbage collection step takes up the majority of the maintenance runtime, with epoch management being a much faster operation to complete.

In an ideal situation, snapshot garbage collection could be done independently of index management, that way the epoch can be rotated once the min duration has expired and an older epoch can be compacted. This would keep the number of index blobs below 1000 as recommended by Kopia.

I am interested to hear your thoughts on the situation and any potential solutions.