techGalen Guan

From Sync to Backup: The Missing Half of AI Agent Data Safety

Last week I wrote about the AI Agent Skills Multi-Machine Sync Problem — how to keep your agent's skills, prompts, and context identical across MacBook, Linux workstation, and cloud VM. The answer was a three-layer architecture: built-in skills via agent updates, third-party skills via manifest + bootstrap, and personal skills via git + external_dirs.

That post covered the sync dimension thoroughly. But there is a second dimension that sync alone cannot address: what happens when the machine itself goes away?

Sync ensures consistency. Backup ensures survival. They are two phases of the same goal — data safety — and you need both.

The Gap in Our Architecture

After the sync architecture was in place, I reviewed what was actually protected versus what was not:

Data Sync Covered? Backup Covered? Risk
Skills (Layer 1) Yes (agent update) No Lost if machine dies
Skills (Layer 2) Yes (manifest + reinstall) No Reinstallable, but config drift
Skills (Layer 3) Yes (git push) Partially (git history) Repo survives, but local edits before push
config.yaml No No Total loss
state.db No No Total loss — all session history gone
memories/ No No Total loss — all learned preferences gone
.env (API keys) No No Total loss — manual re-issuance
scripts/ No No Total loss

Hermes does create state snapshots before updates (~/.hermes/state-snapshots/), but these are thin config copies — not full data archives — and they only trigger on agent version changes, not on a regular cadence.

The sync architecture is like RAID 1 between machines: if one copy gets corrupted, the other still has it. But if your cloud VM gets terminated, or you accidentally rm -rf ~/.hermes, both copies are gone because the local machine is the only source of truth for runtime state.

Enter: hermes-agent-core-backup

The community skill hermes-agent-core-backup by art-solutions takes a refreshingly direct approach to the backup problem:

  1. A single shell script that ZIPs all critical Hermes state
  2. A cron job that runs it nightly
  3. Git push to a private remote repository

The script backs up: config.yaml, state.db (the SQLite brain), memories/, skills/, scripts/, .env, and an optional data directory. It produces timestamped ZIP files like backup_yourname_2026_05_04.zip and commits them to a private git repo.

What It Gets Right

Covers the runtime state gap. Skills in git repos are version-controlled, but state.db, memories, and .env are not. This script grabs everything critical into one archive.

Zero-logic, zero-dependency. Just zip + git. No database dump commands, no API calls, no cloud SDK. If your server has git and zip, it works.

Private repo requirement up front. The README explicitly states that .env containing API keys goes into the archive, so the remote MUST be private. This is a security consideration many DIY backup scripts skip.

Restore is three commands. Clone, unzip, copy. No special tooling needed.

What It Misses

No incremental strategy. Each backup is a full ZIP. After 30 days you have 30 ZIPs in git, each potentially 50-200MB. Git is not designed for binary blob storage — the repo will bloat fast.

No cleanup automation. The script adds new ZIPs but never removes old ones. Over time, the backup repo grows unboundedly. You need manual git rm or a separate cleanup cron.

No verification. The script creates and pushes the ZIP, but never validates that the ZIP can be extracted, that state.db is not corrupted, or that the remote actually received the push.

No differential awareness. It does not check whether anything changed since last backup. Even if you made zero changes, it creates a new full ZIP every night.

Skills are redundantly backed up. Layer 3 skills are already in a git repo. Layer 2 skills are restorable from a manifest. Backing up all skills again in a ZIP duplicates what git already handles better (with history, diffs, and selective revert).

Our Evaluation: Sync vs. Backup Across Six Dimensions

Dimension Sync (Our 3-Layer) Backup (Core-Backup Skill) Ideal Combo
Goal Multi-machine consistency Disaster recovery + point-in-time Both
Granularity Per-file, per-skill Entire .hermes/ as one blob Fine for skills, coarse for state
History Git history (diffs, reverts) Full snapshots (no diff between ZIPs) Git for text, snapshots for binary
Incremental Yes (git is inherently incremental) No (full ZIP every time) Yes for skills, snapshots for state
Secrets handling .gitignore excludes .env .env included (private repo required) Separate secrets manager
Storage cost Low (git objects are compressed) High (binary blobs in git LFS-less repo) Git for code, archive for non-code

The key insight: sync and backup have different optimal storage formats. Skills are text files — git gives them diff history, selective revert, and branch semantics. Runtime state (SQLite, .env) is binary or sensitive — these need encrypted snapshots, not line-by-line diffs.

Closing the Gap: Our Integrated Strategy

Rather than adopting the backup skill as-is, I am integrating its strongest ideas into our existing architecture while preserving what already works.

What We Adopt

  1. Daily cron backup for runtime state. We need this. The script's approach of cron + zip + git push is simple and correct for state.db, memories, config.yaml, and .env.

  2. Private repo requirement for secrets. Any archive containing .env or API keys must live in a private repo. Period.

  3. Explicit restore procedure. Our current setup has no documented restore path. We will add one.

What We Modify

  1. Exclude skills from backup ZIPs. Skills are already version-controlled in git (Layer 3) or restorable from manifest (Layer 2). Including them in the ZIP adds redundant bulk to git. The backup should only cover data that git does not already protect.

  2. Add retention cleanup. Keep the last N backups (configurable, default 7) and auto-prune older ones before pushing. Git repos with unpruned binary blobs grow without bound.

  3. Add backup verification. After creating the ZIP, extract it to a temp directory and verify that state.db opens cleanly, config.yaml parses, and memories/ is non-empty. A backup that cannot be restored is worse than no backup — it gives false confidence.

  4. Separate the backup repo from the skills repo. Our personal-skills repo (Layer 3) has 198 skills. Mixing daily 50MB+ ZIPs into a skill repo would wreck clone time and search performance. The backup repo should be its own private repository.

What We Keep Unchanged

  • 3-layer sync for skills. This is still the right architecture for keeping skills consistent across machines. The backup does not replace it.

  • Manifest-based Layer 2 restoration. On a new machine, install third-party skills from manifest, not from an old ZIP. This always gets you the latest version.

  • external_dirs for personal skills. Git diff and selective revert per skill are more useful than "restore my entire ~/.hermes/skills/ from last Tuesday."

The Complete Picture

Data Safety = Sync + Backup

┌─────────────────────────────────────────────────┐
│                  SYNC (Cross-Machine)           │
│  Layer 1: Built-in        → hermes update       │
│  Layer 2: Third-party     → manifest + bootstrap │
│  Layer 3: Personal skills → git + external_dirs  │
├─────────────────────────────────────────────────┤
│              BACKUP (Disaster Recovery)         │
│  Runtime state (db, config, .env, memories)    │
│  → Nightly cron: zip + verify + git push       │
│  → Retention: keep last 7, auto-prune          │
│  → Private repo, separate from skills          │
│  → Restore: clone → unzip → copy → restart     │
└─────────────────────────────────────────────────┘

Both layers are necessary. Sync gives you consistency — any machine can be your primary. Backup gives you survivability — even if every machine fails, your agent's brain can be reconstructed.

Implementation Notes

The backup script should live at ~/.hermes/scripts/backup_state.sh and be triggered by cron at 01:00 daily. Key design decisions:

  • ZIP only runtime state, not skills. Skills have their own sync mechanism.
  • Verify before push. Extract to /tmp, validate SQLite integrity, check file count.
  • Prune old backups. Keep 7 most recent ZIPs, delete the rest before commit.
  • Separate repo. git@github.com:USER/hermes-state-backup.git — private, not the skills repo.
  • Idempotent restore. The restore script should work on a fresh machine with nothing but git and unzip installed.

Lessons for the Ecosystem

The hermes-agent-core-backup skill highlighted a blind spot that most agent ecosystems share: we optimize for day-to-day usage patterns but neglect the disaster scenario. Every agent framework I surveyed — Claude Code, Cursor, Copilot, Codex CLI, OpenCode, Aider — focuses on syncing configuration across machines, but none has a built-in story for "my server died and I lost everything."

This is understandable. Sync is something users feel daily (different behavior on different machines). Backup is invisible until catastrophe strikes. But the longer you use an AI agent, the more irreplaceable state it accumulates — learned preferences, long-term memories, custom workflows. Losing that is worse than losing any single project's source code, because source code lives in git, but agent state often does not.

The solution is not novel. It is the same pattern that has protected production databases for decades: regular snapshots, off-site storage, verified restores. The community skill applies this pattern correctly. Our contribution is recognizing that backup and sync are complementary, not competing — and designing the two systems to cover different data types with their optimal storage format.

Sync for what changes. Backup for what matters. Both for what you cannot afford to lose.