Hermes Kanban: A Durable Task Board for Multi-Agent Collaboration — Architecture, Internals, and Competitive Landscape

AI agents are moving from solo operators to team collaborators. When you need two researchers running in parallel, an analyst synthesizing their output, then a writer drafting the brief — who coordinates whom? Who handles crash retries? What happens when a human wants to weigh in?

Hermes Kanban is the system that answers these questions. This post takes it apart from scratch: core concepts, state machine and concurrency model, dependency engine, human-in-the-loop mechanism, and how to integrate it into a third-party platform like shujietai. A six-way comparison with mainstream orchestration systems rounds out the picture.

Why a Board, Not Subagents?

Hermes already has delegate_task for subagent calls. Kanban solves a fundamentally different problem:

Dimension	delegate_task	Kanban
Shape	RPC call (fork → join)	Durable message queue + state machine
Crash recovery	None — failed is failed	Block → unblock → re-run; crash → reclaim
Human in the loop	Not supported	Comment / unblock at any point
Cross-agent audit	Lost on context compression	Durable rows in SQLite forever
Coordination	Hierarchical (caller → callee)	Peer — any profile reads/writes any task

One-sentence distinction: delegate_task is a function call; Kanban is a work queue where every handoff is a row any profile (or human) can see and edit.

Hermes Kanban Architecture Overview

Core Concepts

Task

The minimal unit of work. A single SQLite row with title, body, assignee (profile name), status, optional tenant namespace, and idempotency key (dedup for retried automation).

Seven statuses: triage → todo → ready → running → blocked → done → archived.

Link

A task_links row recording a parent → child dependency. The dispatcher promotes todo → ready automatically when all parents reach done. No manual coordination — the engine just runs.

Comment

The inter-agent protocol. Agents and humans append comments; when a worker is (re-)spawned it reads the full comment thread as part of its context.

Workspace

The directory a worker operates in. Three kinds:

scratch (default) — fresh tmp dir, GC'd when the task is archived
dir: — shared persistent directory (Obsidian vault, ops dir, per-account folder). Must be absolute.
worktree — a git worktree for coding tasks

Board

A standalone queue with its own SQLite DB, workspace directory, and dispatcher loop. One install can have many boards; they are absolutely isolated — a worker spawned for board atm10-server physically cannot see tasks on the default board.

Tenant

Soft namespace within a board. One specialist fleet can serve multiple businesses. Workers prefix memory writes with their tenant tag so context doesn't leak.

Dispatcher

A long-lived loop that, every N seconds (default 60): reclaims stale claims → reclaims crashed workers → promotes ready tasks → atomically claims → spawns assigned profiles. Runs inside the gateway by default.

Circuit breaker: after ~5 consecutive spawn failures on the same task the dispatcher auto-blocks it with the last error — prevents thrashing on tasks whose profile doesn't exist or workspace won't mount.

Kanban State Machine & Concurrent Claim Flow

State Machine and Concurrency Model

Status Transitions

triage ──→ todo ──→ ready ──→ running ──→ done
  │          ↑         ↑         │    ↓
  └──────────┘         │      blocked ──→ (unblock) → ready
                       │         │
                  (all parents    └──→ each retry creates a new run
                   done)

The blocked → unblock → ready path is the key human-in-the-loop mechanism. A worker calls kanban_block(reason="..."), a human unblocks via dashboard or CLI, and the dispatcher respawns on the next tick.

Atomic Claims with WAL Concurrency

SQLite runs in WAL mode + BEGIN IMMEDIATE write transactions + compare-and-swap (CAS) updates on tasks.status and tasks.claim_lock. SQLite serializes writers via its WAL lock — at most one claimer wins any given task. Losers observe zero affected rows and move on. No distributed locks, no retry loops.

Claim TTL defaults to 15 minutes. Workers that outlive this window should call kanban_heartbeat() periodically.

Runs — One Row Per Attempt

A task is a logical unit of work; a run is one attempt to execute it. When the dispatcher claims a ready task, it creates a task_runs row and points tasks.current_run_id at it. When the attempt ends (completed, blocked, crashed, timed out, spawn-failed, reclaimed), the run closes with an outcome.

A task attempted three times has three task_runs rows. Full attempt history for postmortems — "the second reviewer attempt approved, the third merged."

Worker Lifecycle

Workers don't shell out to hermes kanban. The dispatcher sets HERMES_KANBAN_TASK=t_abcd in the child's env, which flips on a dedicated kanban toolset — seven tools that read and mutate the board directly via the Python kanban_db layer.

Tool	Purpose
`kanban_show`	Read current task (title, body, parent handoffs, prior attempts, comments)
`kanban_complete`	Finish with structured summary + metadata handoff
`kanban_block`	Escalate for human input with a reason
`kanban_heartbeat`	Signal liveness during long operations
`kanban_comment`	Append a durable note to the task thread
`kanban_create`	(Orchestrators) fan out child tasks
`kanban_link`	(Orchestrators) add dependency edge after the fact

A typical worker turn:

# 1. Read the task
kanban_show()
# 2. Do real work (terminal/file/code tools)...
kanban_heartbeat(note="4 of 8 files transformed")
# 3. Complete with structured handoff
kanban_complete(
    summary="migrated to token-bucket; 14 tests pass",
    metadata={"changed_files": ["limiter.py"], "tests_run": 14},
)

The structured handoff is the key innovation: summary is for humans and downstream agents, metadata is machine-readable JSON that downstream workers consume directly via kanban_show()'s worker_context field — no prose parsing required.

Eight Collaboration Patterns

Pattern	Shape	Example
Fan-out	N siblings, same role	"research 5 angles in parallel"
Pipeline	role chain: scout → editor → writer	daily brief assembly
Voting / quorum	N siblings + 1 aggregator	3 researchers → 1 reviewer picks
Long-running journal	same profile + shared dir + cron	Obsidian vault accumulation
Human-in-the-loop	worker blocks → human comments → unblock	ambiguous decisions
@mention	inline routing from prose	`@reviewer look at this`
Thread-scoped workspace	`/kanban here` in a thread	per-project gateway threads
Fleet farming	one profile, N subjects	50 social accounts

Three Surfaces, One DB

The same kanban_db layer backs three front doors:

Agent tool calls — kanban_* tools inside worker processes
CLI / slash commands — hermes kanban create ... or /kanban list in chat
Dashboard GUI — drag-and-drop, inline create, bulk actions, WebSocket live updates

All three agree by construction — writes go through the same kanban_db code path, so they can never drift.

Integrating with Third-Party Platforms: The shujietai Case

ShuJieTai (数据台) is a FastAPI + Vue 3 AI agent orchestration platform with its own dispatch layer (dispatch_tasks + dispatch_events tables, DispatchWorkerPool, WebSocket streaming). How does Kanban integrate?

Integration Architecture

┌─────────────────────┐      REST API
│  shujietai frontend  │ ◀───────────────┐
│  Vue 3 + WebSocket    │                  │
└──────────┬───────────┘                   │
           │ POST /api/v1/dispatch         │
           ▼                                │
┌─────────────────────┐   HTTP client       │
│  shujietai backend   │ ──────────────── ▼
│  FastAPI dispatch     │      Hermes API Server (8642)
│  service              │ ──────────────── → kanban.db
└─────────────────────┘                    (via hermes CLI
                                             or API bridge)

Approach A: API Server Bridge (Recommended)

Hermes API Server runs on port 8642 inside the gateway, exposing an OpenAI-compatible interface. ShuJieTai's backend already calls this endpoint for LLM inference. Extending for kanban:

Create tasks — shujietai calls hermes kanban create ... CLI or directly accesses kanban_db if co-hosted
Listen for state — subscribe via hermes kanban notify-subscribe to push task events into shujietai's WebSocket channels
Human input — map shujietai's "awaiting input" UI to kanban's blocked status; user reply triggers hermes kanban unblock
Result handoff — worker's kanban_complete summary/metadata consumed by shujietai's dispatch event system

Approach B: Shared SQLite Direct Read

If shujietai and Hermes are co-hosted, read ~/.hermes/kanban.db directly (WAL mode allows concurrent reads). ShuJieTai's cockpit API already has build_runtime_state() — add a kanban panel polling task stats.

Note: shujietai's dispatch layer has its own state machine (queued → running → completed/failed/aborted) which doesn't map 1:1 to kanban's 7 statuses. Build a mapping layer rather than modifying kanban itself.

Key Gotchas

API Server binding guard — setting host: 0.0.0.0 without API_SERVER_KEY silently falls back to 127.0.0.1. Docker containers must set API_SERVER_KEY correctly
API Key mismatch — shujietai's HERMES_API_KEY must match Hermes gateway's API_SERVER_KEY; otherwise 401
Hot-reload .env — docker compose restart does NOT re-read .env; must use docker compose up -d --force-recreate
Session-dispatch gap — shujietai's dispatch_tasks and sessions tables don't auto-sync; after creating a dispatch task, explicitly call store.ingest() to create the session record

Competitive Landscape & shujietai Integration

Competitive Landscape

Dimension	Hermes Kanban	CrewAI	LangGraph	Airflow	Temporal	Prefect
Core model	Durable SQLite board + dispatcher	In-process agent orchestration	Graph state machine	DAG scheduler	Durable workflow engine	Dynamic workflows
Persistence	SQLite WAL (local)	In-memory, optional DB	Checkpoint stores (pluggable)	DB (MySQL/PG/SQLite)	Event sourcing (Cassandra/PG/SQLite)	DB (PG/SQLite)
Dependencies	parent→child link auto-promotion	Task dependency queue	Graph edges + conditional routing	DAG dependencies	await + signal	Dependencies + conditions
Human-in-loop	Native block/unblock + comments	Requires custom HumanInputTool	Requires custom interrupt	Minimal (external sensor)	Native signal + activity	Minimal
Crash recovery	Claim TTL + reclaim + run history	No native support	Checkpoint replay	Retry policies	Event sourcing full replay	Retry + cached results
Cross-agent coord	Peer — any profile reads/writes any task	Sequential/hierarchical flow	Graph nodes implicitly pass state	Task instance context	Workflow signal/child	Flow references
Integration model	CLI + REST + WebSocket + tool calls	Python SDK	Python SDK (LangChain)	Python SDK + REST API	Multi-language SDK + gRPC	Python SDK + REST
Multi-tenancy	Native tenant tag + board isolation	None	None	None native	Namespaces	None native
Live UI	Built-in dashboard plugin	None native	LangGraph Studio (paid)	Flower monitoring	Web UI	Cloud UI / open-source
Pricing	Open source, free	Open source, free	Open source free / Studio paid	Open source, free	Open source free / Cloud paid	Open source free / Cloud paid
Best for	Multi-agent persistent collab + human-in-loop	Fast prototyping	Complex graph state flows	Scheduled batch processing	Long transactions / microservice orchestration	Lightweight data pipelines

Detailed Analysis

CrewAI puts orchestration into Python objects — Agent, Task, Crew as code. Fastest to start, but state lives in memory. Process dies, state dies. Great for proofs of concept and one-shot pipelines.

LangGraph models agent flows as stateful graphs. Conditional edges enable complex branching logic, and checkpoint-based interrupt/resume is a real step above CrewAI. But its "graph as code" model couples orchestration to business logic — changing flow means changing code and redeploying. Studio is a paid product; free-tier debugging is limited.

Airflow is the king of scheduled batch processing. Declarative DAGs, reliable scheduler, huge ecosystem. But it has no native human-in-the-loop primitive — a task requiring human approval must be simulated with external sensors, which is fragile and awkward.

Temporal is the closest philosophical competitor to Hermes Kanban. Event sourcing, automatic crash replay, native support for long transactions and signal interrupts. Steep learning curve (workflow code has strict constraints), but for microservice orchestration and cross-day/cross-week transactions, it's the most robust choice.

Prefect sits between Airflow's weight and CrewAI's lightness — dynamic DAGs, Python-native decorators, decent dashboard. Human-in-the-loop and cross-agent audit trail are weak.

Hermes Kanban's differentiators come down to three things: (1) human-in-the-loop as a first-class citizen — block/unblock/comment isn't bolted on, it's the coordination primitive; (2) peer agent coordination — any profile can read/write any task, not just parent→child; (3) structured handoff — summary + metadata as JSON means downstream agents get a parseable handoff without scraping prose.

10-Minute Quickstart

# 1. Initialize
hermes kanban init

# 2. Make sure gateway is running (hosts the embedded dispatcher)
hermes gateway start

# 3. Create a research task
hermes kanban create "research AI agent orchestration patterns" \
    --assignee researcher --priority 2

# 4. Create a dependency chain
SCHEMA=$(hermes kanban create "Design auth schema" \
    --assignee backend-dev --json | jq -r .id)

hermes kanban create "Implement auth API" \
    --assignee backend-dev --parent $SCHEMA

# 5. Watch in real time
hermes kanban watch

# 6. Check stats
hermes kanban stats

# 7. Open the dashboard
hermes dashboard   # click the Kanban tab

From gateway chat

/kanban list
/kanban create "write launch post" --assignee writer --parent t_research
/kanban comment t_abcd "use the 2026 schema, not 2025"
/kanban unblock t_abcd

Auto-subscribe on create — you get notified when the task completes or blocks.

When to Use What

Scenario	Recommendation
Quick parallel subtask, no persistence needed	`delegate_task`
Multi-role collaboration with human approval gates	Kanban
Scheduled batch processing, data pipelines	Airflow / Prefect
Long transactions across microservices	Temporal
Complex graph state flows, conditional branching	LangGraph
Fast prototype, one-shot pipeline	CrewAI

Sources

Hermes Agent:

Repository: https://github.com/NousResearch/hermes-agent (MIT, ⭐7k+)
Documentation: https://hermes-agent.nousresearch.com/docs/user-guide/features/kanban
Key source files inspected:
- hermes_cli/kanban_db.py — SQLite kanban core, 4000+ lines, state machine, claim, dispatch
- hermes_cli/kanban.py — CLI subcommand entry point
- hermes_cli/kanban_diagnostics.py — diagnostics and circuit breaker
- tools/kanban_tools.py — Agent-side kanban_* tool definitions
- plugins/kanban/dashboard/plugin_api.py — Dashboard REST/WS plugin
- website/docs/user-guide/features/kanban.md — Official docs (776 lines)
- website/docs/user-guide/features/kanban-tutorial.md — Official tutorial

CrewAI:

Repository: https://github.com/crewAIInc/crewAI (MIT, ⭐30k+)
Documentation: https://docs.crewai.com/

LangGraph:

Repository: https://github.com/langchain-ai/langgraph (MIT, ⭐10k+)
Documentation: https://langchain-ai.github.io/langgraph/

Apache Airflow:

Repository: https://github.com/apache/airflow (Apache-2.0, ⭐40k+)
Documentation: https://airflow.apache.org/docs/

Temporal:

Repository: https://github.com/temporalio/temporal (MIT, ⭐12k+)
Documentation: https://docs.temporal.io/

Prefect:

Repository: https://github.com/PrefectHQ/prefect (Apache-2.0, ⭐18k+)
Documentation: https://docs.prefect.io/

shujietai (数据台):

Project path: /home/guancy/workspace/shujietai (internal project)
Key files:
- backend/app/services/dispatch_service.py — dispatch state machine and CRUD
- backend/app/services/dispatch_worker.py — async worker streaming AI responses
- backend/app/api/routes_dispatch.py — REST endpoints
- frontend/src/composables/useDispatchTask.js — frontend task lifecycle