aiGalen Guan

Hermes Kanban: A Durable Task Board for Multi-Agent Collaboration — Architecture, Internals, and Competitive Landscape

AI agents are moving from solo operators to team collaborators. When you need two researchers running in parallel, an analyst synthesizing their output, then a writer drafting the brief — who coordinates whom? Who handles crash retries? What happens when a human wants to weigh in?

Hermes Kanban is the system that answers these questions. This post takes it apart from scratch: core concepts, state machine and concurrency model, dependency engine, human-in-the-loop mechanism, and how to integrate it into a third-party platform like shujietai. A six-way comparison with mainstream orchestration systems rounds out the picture.

Why a Board, Not Subagents?

Hermes already has delegate_task for subagent calls. Kanban solves a fundamentally different problem:

Dimension delegate_task Kanban
Shape RPC call (fork → join) Durable message queue + state machine
Crash recovery None — failed is failed Block → unblock → re-run; crash → reclaim
Human in the loop Not supported Comment / unblock at any point
Cross-agent audit Lost on context compression Durable rows in SQLite forever
Coordination Hierarchical (caller → callee) Peer — any profile reads/writes any task

One-sentence distinction: delegate_task is a function call; Kanban is a work queue where every handoff is a row any profile (or human) can see and edit.

Hermes Kanban Architecture Overview

Core Concepts

Task

The minimal unit of work. A single SQLite row with title, body, assignee (profile name), status, optional tenant namespace, and idempotency key (dedup for retried automation).

Seven statuses: triage → todo → ready → running → blocked → done → archived.

Link

A task_links row recording a parent → child dependency. The dispatcher promotes todo → ready automatically when all parents reach done. No manual coordination — the engine just runs.

Comment

The inter-agent protocol. Agents and humans append comments; when a worker is (re-)spawned it reads the full comment thread as part of its context.

Workspace

The directory a worker operates in. Three kinds:

  • scratch (default) — fresh tmp dir, GC'd when the task is archived
  • dir: — shared persistent directory (Obsidian vault, ops dir, per-account folder). Must be absolute.
  • worktree — a git worktree for coding tasks

Board

A standalone queue with its own SQLite DB, workspace directory, and dispatcher loop. One install can have many boards; they are absolutely isolated — a worker spawned for board atm10-server physically cannot see tasks on the default board.

Tenant

Soft namespace within a board. One specialist fleet can serve multiple businesses. Workers prefix memory writes with their tenant tag so context doesn't leak.

Dispatcher

A long-lived loop that, every N seconds (default 60): reclaims stale claims → reclaims crashed workers → promotes ready tasks → atomically claims → spawns assigned profiles. Runs inside the gateway by default.

Circuit breaker: after ~5 consecutive spawn failures on the same task the dispatcher auto-blocks it with the last error — prevents thrashing on tasks whose profile doesn't exist or workspace won't mount.

Kanban State Machine & Concurrent Claim Flow

State Machine and Concurrency Model

Status Transitions

triage ──→ todo ──→ ready ──→ running ──→ done
  │          ↑         ↑         │    ↓
  └──────────┘         │      blocked ──→ (unblock) → ready
                       │         │
                  (all parents    └──→ each retry creates a new run
                   done)

The blocked → unblock → ready path is the key human-in-the-loop mechanism. A worker calls kanban_block(reason="..."), a human unblocks via dashboard or CLI, and the dispatcher respawns on the next tick.

Atomic Claims with WAL Concurrency

SQLite runs in WAL mode + BEGIN IMMEDIATE write transactions + compare-and-swap (CAS) updates on tasks.status and tasks.claim_lock. SQLite serializes writers via its WAL lock — at most one claimer wins any given task. Losers observe zero affected rows and move on. No distributed locks, no retry loops.

Claim TTL defaults to 15 minutes. Workers that outlive this window should call kanban_heartbeat() periodically.

Runs — One Row Per Attempt

A task is a logical unit of work; a run is one attempt to execute it. When the dispatcher claims a ready task, it creates a task_runs row and points tasks.current_run_id at it. When the attempt ends (completed, blocked, crashed, timed out, spawn-failed, reclaimed), the run closes with an outcome.

A task attempted three times has three task_runs rows. Full attempt history for postmortems — "the second reviewer attempt approved, the third merged."

Worker Lifecycle

Workers don't shell out to hermes kanban. The dispatcher sets HERMES_KANBAN_TASK=t_abcd in the child's env, which flips on a dedicated kanban toolset — seven tools that read and mutate the board directly via the Python kanban_db layer.

Tool Purpose
kanban_show Read current task (title, body, parent handoffs, prior attempts, comments)
kanban_complete Finish with structured summary + metadata handoff
kanban_block Escalate for human input with a reason
kanban_heartbeat Signal liveness during long operations
kanban_comment Append a durable note to the task thread
kanban_create (Orchestrators) fan out child tasks
kanban_link (Orchestrators) add dependency edge after the fact

A typical worker turn:

# 1. Read the task
kanban_show()
# 2. Do real work (terminal/file/code tools)...
kanban_heartbeat(note="4 of 8 files transformed")
# 3. Complete with structured handoff
kanban_complete(
    summary="migrated to token-bucket; 14 tests pass",
    metadata={"changed_files": ["limiter.py"], "tests_run": 14},
)

The structured handoff is the key innovation: summary is for humans and downstream agents, metadata is machine-readable JSON that downstream workers consume directly via kanban_show()'s worker_context field — no prose parsing required.

Eight Collaboration Patterns

Pattern Shape Example
Fan-out N siblings, same role "research 5 angles in parallel"
Pipeline role chain: scout → editor → writer daily brief assembly
Voting / quorum N siblings + 1 aggregator 3 researchers → 1 reviewer picks
Long-running journal same profile + shared dir + cron Obsidian vault accumulation
Human-in-the-loop worker blocks → human comments → unblock ambiguous decisions
@mention inline routing from prose @reviewer look at this
Thread-scoped workspace /kanban here in a thread per-project gateway threads
Fleet farming one profile, N subjects 50 social accounts

Three Surfaces, One DB

The same kanban_db layer backs three front doors:

  1. Agent tool callskanban_* tools inside worker processes
  2. CLI / slash commandshermes kanban create ... or /kanban list in chat
  3. Dashboard GUI — drag-and-drop, inline create, bulk actions, WebSocket live updates

All three agree by construction — writes go through the same kanban_db code path, so they can never drift.

Integrating with Third-Party Platforms: The shujietai Case

ShuJieTai (数据台) is a FastAPI + Vue 3 AI agent orchestration platform with its own dispatch layer (dispatch_tasks + dispatch_events tables, DispatchWorkerPool, WebSocket streaming). How does Kanban integrate?

Integration Architecture

┌─────────────────────┐      REST API
│  shujietai frontend  │ ◀───────────────┐
│  Vue 3 + WebSocket    │                  │
└──────────┬───────────┘                   │
           │ POST /api/v1/dispatch         │
           ▼                                │
┌─────────────────────┐   HTTP client       │
│  shujietai backend   │ ──────────────── ▼
│  FastAPI dispatch     │      Hermes API Server (8642)
│  service              │ ──────────────── → kanban.db
└─────────────────────┘                    (via hermes CLI
                                             or API bridge)

Approach A: API Server Bridge (Recommended)

Hermes API Server runs on port 8642 inside the gateway, exposing an OpenAI-compatible interface. ShuJieTai's backend already calls this endpoint for LLM inference. Extending for kanban:

  1. Create tasks — shujietai calls hermes kanban create ... CLI or directly accesses kanban_db if co-hosted
  2. Listen for state — subscribe via hermes kanban notify-subscribe to push task events into shujietai's WebSocket channels
  3. Human input — map shujietai's "awaiting input" UI to kanban's blocked status; user reply triggers hermes kanban unblock
  4. Result handoff — worker's kanban_complete summary/metadata consumed by shujietai's dispatch event system

Approach B: Shared SQLite Direct Read

If shujietai and Hermes are co-hosted, read ~/.hermes/kanban.db directly (WAL mode allows concurrent reads). ShuJieTai's cockpit API already has build_runtime_state() — add a kanban panel polling task stats.

Note: shujietai's dispatch layer has its own state machine (queued → running → completed/failed/aborted) which doesn't map 1:1 to kanban's 7 statuses. Build a mapping layer rather than modifying kanban itself.

Key Gotchas

  • API Server binding guard — setting host: 0.0.0.0 without API_SERVER_KEY silently falls back to 127.0.0.1. Docker containers must set API_SERVER_KEY correctly
  • API Key mismatch — shujietai's HERMES_API_KEY must match Hermes gateway's API_SERVER_KEY; otherwise 401
  • Hot-reload .envdocker compose restart does NOT re-read .env; must use docker compose up -d --force-recreate
  • Session-dispatch gap — shujietai's dispatch_tasks and sessions tables don't auto-sync; after creating a dispatch task, explicitly call store.ingest() to create the session record

Competitive Landscape & shujietai Integration

Competitive Landscape

Dimension Hermes Kanban CrewAI LangGraph Airflow Temporal Prefect
Core model Durable SQLite board + dispatcher In-process agent orchestration Graph state machine DAG scheduler Durable workflow engine Dynamic workflows
Persistence SQLite WAL (local) In-memory, optional DB Checkpoint stores (pluggable) DB (MySQL/PG/SQLite) Event sourcing (Cassandra/PG/SQLite) DB (PG/SQLite)
Dependencies parent→child link auto-promotion Task dependency queue Graph edges + conditional routing DAG dependencies await + signal Dependencies + conditions
Human-in-loop Native block/unblock + comments Requires custom HumanInputTool Requires custom interrupt Minimal (external sensor) Native signal + activity Minimal
Crash recovery Claim TTL + reclaim + run history No native support Checkpoint replay Retry policies Event sourcing full replay Retry + cached results
Cross-agent coord Peer — any profile reads/writes any task Sequential/hierarchical flow Graph nodes implicitly pass state Task instance context Workflow signal/child Flow references
Integration model CLI + REST + WebSocket + tool calls Python SDK Python SDK (LangChain) Python SDK + REST API Multi-language SDK + gRPC Python SDK + REST
Multi-tenancy Native tenant tag + board isolation None None None native Namespaces None native
Live UI Built-in dashboard plugin None native LangGraph Studio (paid) Flower monitoring Web UI Cloud UI / open-source
Pricing Open source, free Open source, free Open source free / Studio paid Open source, free Open source free / Cloud paid Open source free / Cloud paid
Best for Multi-agent persistent collab + human-in-loop Fast prototyping Complex graph state flows Scheduled batch processing Long transactions / microservice orchestration Lightweight data pipelines

Detailed Analysis

CrewAI puts orchestration into Python objects — Agent, Task, Crew as code. Fastest to start, but state lives in memory. Process dies, state dies. Great for proofs of concept and one-shot pipelines.

LangGraph models agent flows as stateful graphs. Conditional edges enable complex branching logic, and checkpoint-based interrupt/resume is a real step above CrewAI. But its "graph as code" model couples orchestration to business logic — changing flow means changing code and redeploying. Studio is a paid product; free-tier debugging is limited.

Airflow is the king of scheduled batch processing. Declarative DAGs, reliable scheduler, huge ecosystem. But it has no native human-in-the-loop primitive — a task requiring human approval must be simulated with external sensors, which is fragile and awkward.

Temporal is the closest philosophical competitor to Hermes Kanban. Event sourcing, automatic crash replay, native support for long transactions and signal interrupts. Steep learning curve (workflow code has strict constraints), but for microservice orchestration and cross-day/cross-week transactions, it's the most robust choice.

Prefect sits between Airflow's weight and CrewAI's lightness — dynamic DAGs, Python-native decorators, decent dashboard. Human-in-the-loop and cross-agent audit trail are weak.

Hermes Kanban's differentiators come down to three things: (1) human-in-the-loop as a first-class citizen — block/unblock/comment isn't bolted on, it's the coordination primitive; (2) peer agent coordination — any profile can read/write any task, not just parent→child; (3) structured handoff — summary + metadata as JSON means downstream agents get a parseable handoff without scraping prose.

10-Minute Quickstart

# 1. Initialize
hermes kanban init

# 2. Make sure gateway is running (hosts the embedded dispatcher)
hermes gateway start

# 3. Create a research task
hermes kanban create "research AI agent orchestration patterns" \
    --assignee researcher --priority 2

# 4. Create a dependency chain
SCHEMA=$(hermes kanban create "Design auth schema" \
    --assignee backend-dev --json | jq -r .id)

hermes kanban create "Implement auth API" \
    --assignee backend-dev --parent $SCHEMA

# 5. Watch in real time
hermes kanban watch

# 6. Check stats
hermes kanban stats

# 7. Open the dashboard
hermes dashboard   # click the Kanban tab

From gateway chat

/kanban list
/kanban create "write launch post" --assignee writer --parent t_research
/kanban comment t_abcd "use the 2026 schema, not 2025"
/kanban unblock t_abcd

Auto-subscribe on create — you get notified when the task completes or blocks.

When to Use What

Scenario Recommendation
Quick parallel subtask, no persistence needed delegate_task
Multi-role collaboration with human approval gates Kanban
Scheduled batch processing, data pipelines Airflow / Prefect
Long transactions across microservices Temporal
Complex graph state flows, conditional branching LangGraph
Fast prototype, one-shot pipeline CrewAI

Sources

Hermes Agent:

  • Repository: https://github.com/NousResearch/hermes-agent (MIT, ⭐7k+)
  • Documentation: https://hermes-agent.nousresearch.com/docs/user-guide/features/kanban
  • Key source files inspected:
    • hermes_cli/kanban_db.py — SQLite kanban core, 4000+ lines, state machine, claim, dispatch
    • hermes_cli/kanban.py — CLI subcommand entry point
    • hermes_cli/kanban_diagnostics.py — diagnostics and circuit breaker
    • tools/kanban_tools.py — Agent-side kanban_* tool definitions
    • plugins/kanban/dashboard/plugin_api.py — Dashboard REST/WS plugin
    • website/docs/user-guide/features/kanban.md — Official docs (776 lines)
    • website/docs/user-guide/features/kanban-tutorial.md — Official tutorial

CrewAI:

LangGraph:

Apache Airflow:

Temporal:

Prefect:

shujietai (数据台):

  • Project path: /home/guancy/workspace/shujietai (internal project)
  • Key files:
    • backend/app/services/dispatch_service.py — dispatch state machine and CRUD
    • backend/app/services/dispatch_worker.py — async worker streaming AI responses
    • backend/app/api/routes_dispatch.py — REST endpoints
    • frontend/src/composables/useDispatchTask.js — frontend task lifecycle