[DESIGN] Additional features for snapshot creation

I’ve been thinking about expanding the snapshot creation feature (both CLI and Snapshot).
There are couple important scenarios I’d like to factor in, ideally without breaking the CLI.

Snapshot of data piped through stdin (single file)

This was requested by several folks here and on GitHub. I’m thinking something like:

mysqldump .... | kopia snapshot create --stdin-file=/some/virtual/path

Kopia would upload output of mysqldump and record snapshot manifest as if it came from file /some/virtual/path.

Snapshot from data piped through stdin (directory in a TAR format)

This is similar to #1 by the input contains a TAR-formatted directory stream:

tar cf something/ | kopia snapshot create --stdin-tar=/some/path

This would save the (directory) manifest as /some/path.

Problems with standard input

There however is a problem with stdin. When we run kopia snapshot create --all Kopia can’t really snapshot /some/path in the example above because it was virtual to begin with, and we need to re-run the original command but Kopia never had knowledge of it.

We can either:

a) put special marker on stdin-sourced snapshots and ignore them when --all is passed.

b) remember the command that produced data to snapshot and run it ourselves

If kopia were to launch extra commands before/after snapshotting, it would open up a lot of interesting new cases, including:

  1. creating FS-specific consistent snapshots (e.g. ZFS) and unmounting them at the end of snapshots.
  2. launching external backup tools (mysqldump)
  3. custom notifications scripts

I have 2 questions:

  1. Is this a good idea to allow Kopia to launch external commands or should we rely on outside scripting for that?

  2. How do we specify the commands to run? I’m thinking using policies, where you can attach scripts to each file/directory that run before/after the file/directory is snapshotted or to provide content:

The before script would be able to:

a) provide data (mysqldump or tar) to upload by writing it to stdout
b) provide alternative location of data (file or directory), so the script can take a snapshot and mount it somewhere and redirect Kopia to read from the mounted dir.
c) do something unrelated (send notifications, etc.)

There will be corresponding after script that can clean up after before.

For example:

kopia policy set /sqldump --data-provider-command "mysqldump ..."
kopia snapshot create /sqldump
kopia policy set /some/dir \
   --before-command "zfs snapshot ... && zfs mount && echo REDIRECT /new/dir" \
   --after-command "zfs unmount ..."

kopia snapshot create /some/dir

There’s obviously a ton of more details to figure out, but I’d love to hear your reaction and any more suggestions.

That’s in deed a really good idea as this is really an issue I have been thinking about myself. There’re always pros and cons to allow any service to run some other external command, but I like the layout you suggested, especially the policies part…

I’ve been thinking about this feature some more and would like to define it more precisely. I think we can simplify the whole experience if we focus on a non-streaming case (at least initially, if there’s a strong case for streaming data from stdin we can extend this in the future).

We will allow users to provide 2 scripts that can be attached to each directory via policy: before and after, which will run before and after processing a particular directory respectively.

Each script must be a single file (perhaps a binary executable in the future) stored inside a repository or outside of it. If inside, the script is identified by its object ID/hash (e.g. 2ffb7ccdfdc68a3d28e3df3f9eda9795). If stored outside of the repository, it’s a file name and the user must ensure that the script will be available to each invocation. Each script will be executed using OS-specific shell (/bin/sh, cmd.exe, etc.).

To set the script for a directory we can use kopia policy set

For example:

kopia policy set /some/dir --before-script /path/to/file.sh
kopia policy set /some/dir --before-script-object 2ffb7ccdfdc68a3d28e3df3f9eda9795
kopia policy set /some/dir --after-script /path/to/file.sh
kopia policy set /some/dir --after-script-object 3ffb7ccdfdc68a3d28e3df3f9eda9795

‘Before’ Script

The ‘Before’ script parameters will be passed through environment variables:

  • KOPIA_SNAPSHOT_ID - unique snapshot ID (random 64-bit number that can be assumed to be locally unique)
  • KOPIA_SOURCE_PATH - path to a directory that’s being snapshotted
  • KOPIA_WORK_DIR - per-invocation directory that can be used by the script to store extra data, and will be passed to the after script as well.

The script must prepare the directory KOPIA_SOURCE_PATH for snapshotting. There are two options here:

  • in-place snapshot - where the script creates additional files in the KOPIA_SOURCE_PATH directory (for example by running commands to dump SQL database data, TAR some files, etc.) and tells Kopia to snapshot them

  • mounted snapshot - where the script creates a snapshot (such as ZFS snapshot) and mounts it at another local filesystem location redirecting Kopia to read from it.

In case of mounted snapshot, the script must signal to Kopia the directory to be processed by printing to STDOUT a line containing KOPIA_SNAPSHOT_PATH <local-path>

After running the script, Kopia will parse the stdout looking for this marker and will proceed to snapshotting all the files in recursively as if they had been read from KOPIA_SOURCE_PATH. It will also recursively apply policies, etc, including possibly running additional scripts for child entries.

‘After’ Script

‘After’ script will be invoked after processing the directory (regardless of whether ‘before’ script was specified and/or succeeded) and will be passed the same parameters plus KOPIA_SNAPSHOT_PATH which will come from the output of the ‘before’ script:

  • KOPIA_SNAPSHOT_ID - unique snapshot ID (random 64-bit number that can be assumed to be locally unique)
  • KOPIA_SOURCE_PATH
  • KOPIA_WORK_DIR
  • KOPIA_SNAPSHOT_PATH
  • KOPIA_SNAPSHOT_START_TIME // timestamp when the snapshot was started (RFC3339)
  • KOPIA_SNAPSHOT_END_TIME // timestamp when the snapshot was completed (RFC3339)
  • KOPIA_SNAPSHOT_BYTES // size of snapshotted data in bytes
  • KOPIA_SNAPSHOT_UPLOADED // size of uploaded data in bytes
  • KOPIA_SNAPSHOT_ERRORS // number of errors

The ‘after’ script is supposed to remove any files/directories created by the ‘before’ script, unmount any snapshots, etc. It must be idempotent (it may be potentially invoked more than once and should not fail on subsequent invocation, e.g. if the source path does not exist).

Examples:

SQL Snapshotting (in-place snapshot):

#!/bin/sh
set -e
mysqldump SomeDatabase --result-file=$KOPIA_SOURCE_PATH/dump.sql

ZFS Snapshot (mounted snapshot)

Before:

#!/bin/sh
set -e
zfs snapshot -r Volumes@$KOPIA_SNAPSHOT_ID
mkdir -p /mnt/$KOPIA_SNAPSHOT_ID
mount -t zfs Volumes@$KOPIA_SNAPSHOT_ID /mnt/$KOPIA_SNAPSHOT_ID
echo KOPIA_SNAPSHOT_PATH: /mnt/$KOPIA_SNAPSHOT_ID

After:

#!/bin/sh
umount /mnt/$KOPIA_SNAPSHOT_ID
rmdir /mnt/$KOPIA_SNAPSHOT_ID

Notification script (‘after’):

#!/bin/bash

echo Snapshot of $KOPIA_SOURCE_PATH finished at $KOPIA_SNAPSHOT_END_TIME and uploaded $KOPIA_SNAPSHOT_UPLOADED bytes with $KOPIA_SNAPSHOT_ERROR errors | mail root -s "Kopia snapshot report" 

Please let me know what you think and try to poke holes in this idea.

3 Likes

Very clean - I like this approach. At this time, I cannot see any apparent errors, but this may change, once this feature is available for testing and people start pounding on it - however, I doubt it.

@jkowalski It appears that you are going along the lines of Pacman’s hooks. Arch Linux’s package manager has almost identical feature where you specify a “pre-” or “post-” transaction “hooks”, which are effectively path to a executable/script that runs when a specific package is changed. I feel like your approach is very similar, and having used Arch/Pacman for years, I think it works reasonably well. You could perhaps borrow some of these ideas into Kopia.