Bzfs 1.13.0 – 1‑second (even sub‑second) ZFS replication across fleets

3 points by werwolf 8 hours ago

bzfs is a simple, reliable CLI for replicating ZFS snapshots (zfs send/receive) locally or over SSH. Its companion, bzfs_jobrunner, turns that into periodic snapshot/replication/pruning jobs across N source hosts and M destination hosts, driven by one versioned job config.

This release makes 1‑second replication frequency practical for small incrementals, and even sub‑second frequency possible in constrained setups (low RTT, few datasets, daemon mode).

v1.13.0 focuses on cutting per‑iteration latency — the enemy of high‑frequency replication at fleet scale:

- SSH reuse across datasets and on startup: fewer handshakes and fewer round‑trips, which is where small incremental sends spend much of their time. - Earlier stream start: estimate "bytes to send" in parallel so the data path can open sooner instead of blocking on preflight. - Smarter caching: faster snapshot list hashing and shorter cache paths to reduce repeated ZFS queries in tight loops. - More resilient connects: retry the SSH control path briefly before failing to smooth over transient blips. - Cleaner ops: normalized exit codes; suppress “Broken pipe” noise when a user kills a pipeline.

Why this matters - At 1s cadence, fixed costs (session setup, snapshot enumeration) dominate. Shaving RTTs and redundant `zfs list` calls yields bigger wins than raw throughput. - For fleets, the tail matters: reducing per‑job jitter and startup overhead improves end‑to‑end staleness when multiplied by N×M jobs.

1‑second (and sub‑second) replication - Use daemon mode to avoid per‑process startup costs; keep the process hot and loop at `--daemon-replication-frequency` (e.g., `1s`, even `100ms` for constrained cases). - Reuse SSH connections (now default) to avoid handshakes even for new processes. - Keep per‑dataset snapshot counts low and prune aggressively; fewer entries make `zfs list -t snapshot` faster. - Limit scope to only datasets that truly need the cadence (filters like `--exclude-dataset`, `--skip-parent`). - In fleets, add small jitter to avoid thundering herds, and cap workers to match CPU, I/O, and link RTT.

How it works (nutshell) - Incremental sends from the latest common snapshot; bookmarks supported for safety and reduced state. - Persistent SSH sessions are reused across datasets/zpools and across runs to avoid handshake/exec overhead. - Snapshot enumeration uses a cache to avoid re‑scanning when nothing changed. - Job orchestration via bzfs_jobrunner: same config file runs on all hosts; add jitter to avoid thundering herds; set worker counts/timeouts for scale.

High‑frequency tips - Prune at a frequency proportional to snapshot creation to keep enumerations fast. - Use daemon mode; split snapshot/replicate/prune into dedicated loops. - Add small random start jitter across hosts to reduce cross‑fleet contention. - Tune jobrunner `--workers` and per‑worker timeouts for your I/O and RTT envelope.

Quick examples - Local replicate: `bzfs pool/src/ds pool/backup/ds` - Pull from remote: `bzfs user@host:pool/src/ds pool/backup/ds` - Jobrunner (periodic): run the shared jobconfig with daemon mode for 1s cadence: `... --replicate --daemon-replication-frequency 1s` (sub‑second like `100ms` is possible in constrained setups). Use separate daemons for `--create-src-snapshots`, `--replicate`, and `--prune-`.

Links - Code and docs: https://github.com/whoschek/bzfs - README: quickstart, filters, safety flags, examples - Jobrunner README: multi‑host orchestration, jitter, daemon mode, frequencies - 1.13.0 diff: https://github.com/whoschek/bzfs/compare/v1.12.0...v1.13.0

Notes - Standard tooling only (ZFS/Unix and Python); no extra runtime deps.

I’d love performance feedback from folks running 1s or sub‑second replication across multiple datasets/hosts: - per‑iteration wall time, number/size of incremental snapshots, dataset counts, and link RTTs help contextualize results.

Happy to answer questions!