Hermes Kanban: A Durable Task Board for Multi-Agent Collaboration — Architecture, Internals, and Competitive Landscape
AI agents are moving from solo operators to team collaborators. When you need two researchers running in parallel, an analyst synthesizing their output, then a writer drafting the brief — who coordinates whom? Who handles crash retries? What happens when a human wants to weigh in?
Hermes Kanban is the system that answers these questions. This post takes it apart from scratch: core concepts, state machine and concurrency model, dependency engine, human-in-the-loop mechanism, and how to integrate it into a third-party platform like shujietai. A six-way comparison with mainstream orchestration systems rounds out the picture.
Why a Board, Not Subagents?
Hermes already has delegate_task for subagent calls. Kanban solves a fundamentally different problem:
| Dimension | delegate_task | Kanban |
|---|---|---|
| Shape | RPC call (fork → join) | Durable message queue + state machine |
| Crash recovery | None — failed is failed | Block → unblock → re-run; crash → reclaim |
| Human in the loop | Not supported | Comment / unblock at any point |
| Cross-agent audit | Lost on context compression | Durable rows in SQLite forever |
| Coordination | Hierarchical (caller → callee) | Peer — any profile reads/writes any task |
One-sentence distinction: delegate_task is a function call; Kanban is a work queue where every handoff is a row any profile (or human) can see and edit.
Core Concepts
Task
The minimal unit of work. A single SQLite row with title, body, assignee (profile name), status, optional tenant namespace, and idempotency key (dedup for retried automation).
Seven statuses: triage → todo → ready → running → blocked → done → archived.
Link
A task_links row recording a parent → child dependency. The dispatcher promotes todo → ready automatically when all parents reach done. No manual coordination — the engine just runs.
Comment
The inter-agent protocol. Agents and humans append comments; when a worker is (re-)spawned it reads the full comment thread as part of its context.
Workspace
The directory a worker operates in. Three kinds:
- scratch (default) — fresh tmp dir, GC'd when the task is archived
- dir:
— shared persistent directory (Obsidian vault, ops dir, per-account folder). Must be absolute. - worktree — a git worktree for coding tasks
Board
A standalone queue with its own SQLite DB, workspace directory, and dispatcher loop. One install can have many boards; they are absolutely isolated — a worker spawned for board atm10-server physically cannot see tasks on the default board.
Tenant
Soft namespace within a board. One specialist fleet can serve multiple businesses. Workers prefix memory writes with their tenant tag so context doesn't leak.
Dispatcher
A long-lived loop that, every N seconds (default 60): reclaims stale claims → reclaims crashed workers → promotes ready tasks → atomically claims → spawns assigned profiles. Runs inside the gateway by default.
Circuit breaker: after ~5 consecutive spawn failures on the same task the dispatcher auto-blocks it with the last error — prevents thrashing on tasks whose profile doesn't exist or workspace won't mount.
State Machine and Concurrency Model
Status Transitions
triage ──→ todo ──→ ready ──→ running ──→ done
│ ↑ ↑ │ ↓
└──────────┘ │ blocked ──→ (unblock) → ready
│ │
(all parents └──→ each retry creates a new run
done)
The blocked → unblock → ready path is the key human-in-the-loop mechanism. A worker calls kanban_block(reason="..."), a human unblocks via dashboard or CLI, and the dispatcher respawns on the next tick.
Atomic Claims with WAL Concurrency
SQLite runs in WAL mode + BEGIN IMMEDIATE write transactions + compare-and-swap (CAS) updates on tasks.status and tasks.claim_lock. SQLite serializes writers via its WAL lock — at most one claimer wins any given task. Losers observe zero affected rows and move on. No distributed locks, no retry loops.
Claim TTL defaults to 15 minutes. Workers that outlive this window should call kanban_heartbeat() periodically.
Runs — One Row Per Attempt
A task is a logical unit of work; a run is one attempt to execute it. When the dispatcher claims a ready task, it creates a task_runs row and points tasks.current_run_id at it. When the attempt ends (completed, blocked, crashed, timed out, spawn-failed, reclaimed), the run closes with an outcome.
A task attempted three times has three task_runs rows. Full attempt history for postmortems — "the second reviewer attempt approved, the third merged."
Worker Lifecycle
Workers don't shell out to hermes kanban. The dispatcher sets HERMES_KANBAN_TASK=t_abcd in the child's env, which flips on a dedicated kanban toolset — seven tools that read and mutate the board directly via the Python kanban_db layer.
| Tool | Purpose |
|---|---|
kanban_show |
Read current task (title, body, parent handoffs, prior attempts, comments) |
kanban_complete |
Finish with structured summary + metadata handoff |
kanban_block |
Escalate for human input with a reason |
kanban_heartbeat |
Signal liveness during long operations |
kanban_comment |
Append a durable note to the task thread |
kanban_create |
(Orchestrators) fan out child tasks |
kanban_link |
(Orchestrators) add dependency edge after the fact |
A typical worker turn:
# 1. Read the task
kanban_show()
# 2. Do real work (terminal/file/code tools)...
kanban_heartbeat(note="4 of 8 files transformed")
# 3. Complete with structured handoff
kanban_complete(
summary="migrated to token-bucket; 14 tests pass",
metadata={"changed_files": ["limiter.py"], "tests_run": 14},
)
The structured handoff is the key innovation: summary is for humans and downstream agents, metadata is machine-readable JSON that downstream workers consume directly via kanban_show()'s worker_context field — no prose parsing required.
Eight Collaboration Patterns
| Pattern | Shape | Example |
|---|---|---|
| Fan-out | N siblings, same role | "research 5 angles in parallel" |
| Pipeline | role chain: scout → editor → writer | daily brief assembly |
| Voting / quorum | N siblings + 1 aggregator | 3 researchers → 1 reviewer picks |
| Long-running journal | same profile + shared dir + cron | Obsidian vault accumulation |
| Human-in-the-loop | worker blocks → human comments → unblock | ambiguous decisions |
| @mention | inline routing from prose | @reviewer look at this |
| Thread-scoped workspace | /kanban here in a thread |
per-project gateway threads |
| Fleet farming | one profile, N subjects | 50 social accounts |
Three Surfaces, One DB
The same kanban_db layer backs three front doors:
- Agent tool calls —
kanban_*tools inside worker processes - CLI / slash commands —
hermes kanban create ...or/kanban listin chat - Dashboard GUI — drag-and-drop, inline create, bulk actions, WebSocket live updates
All three agree by construction — writes go through the same kanban_db code path, so they can never drift.
Integrating with Third-Party Platforms: The shujietai Case
ShuJieTai (数据台) is a FastAPI + Vue 3 AI agent orchestration platform with its own dispatch layer (dispatch_tasks + dispatch_events tables, DispatchWorkerPool, WebSocket streaming). How does Kanban integrate?
Integration Architecture
┌─────────────────────┐ REST API
│ shujietai frontend │ ◀───────────────┐
│ Vue 3 + WebSocket │ │
└──────────┬───────────┘ │
│ POST /api/v1/dispatch │
▼ │
┌─────────────────────┐ HTTP client │
│ shujietai backend │ ──────────────── ▼
│ FastAPI dispatch │ Hermes API Server (8642)
│ service │ ──────────────── → kanban.db
└─────────────────────┘ (via hermes CLI
or API bridge)
Approach A: API Server Bridge (Recommended)
Hermes API Server runs on port 8642 inside the gateway, exposing an OpenAI-compatible interface. ShuJieTai's backend already calls this endpoint for LLM inference. Extending for kanban:
- Create tasks — shujietai calls
hermes kanban create ...CLI or directly accesseskanban_dbif co-hosted - Listen for state — subscribe via
hermes kanban notify-subscribeto push task events into shujietai's WebSocket channels - Human input — map shujietai's "awaiting input" UI to kanban's
blockedstatus; user reply triggershermes kanban unblock - Result handoff — worker's
kanban_completesummary/metadata consumed by shujietai's dispatch event system
Approach B: Shared SQLite Direct Read
If shujietai and Hermes are co-hosted, read ~/.hermes/kanban.db directly (WAL mode allows concurrent reads). ShuJieTai's cockpit API already has build_runtime_state() — add a kanban panel polling task stats.
Note: shujietai's dispatch layer has its own state machine (queued → running → completed/failed/aborted) which doesn't map 1:1 to kanban's 7 statuses. Build a mapping layer rather than modifying kanban itself.
Key Gotchas
- API Server binding guard — setting
host: 0.0.0.0withoutAPI_SERVER_KEYsilently falls back to127.0.0.1. Docker containers must setAPI_SERVER_KEYcorrectly - API Key mismatch — shujietai's
HERMES_API_KEYmust match Hermes gateway'sAPI_SERVER_KEY; otherwise 401 - Hot-reload .env —
docker compose restartdoes NOT re-read.env; must usedocker compose up -d --force-recreate - Session-dispatch gap — shujietai's
dispatch_tasksandsessionstables don't auto-sync; after creating a dispatch task, explicitly callstore.ingest()to create the session record
Competitive Landscape
| Dimension | Hermes Kanban | CrewAI | LangGraph | Airflow | Temporal | Prefect |
|---|---|---|---|---|---|---|
| Core model | Durable SQLite board + dispatcher | In-process agent orchestration | Graph state machine | DAG scheduler | Durable workflow engine | Dynamic workflows |
| Persistence | SQLite WAL (local) | In-memory, optional DB | Checkpoint stores (pluggable) | DB (MySQL/PG/SQLite) | Event sourcing (Cassandra/PG/SQLite) | DB (PG/SQLite) |
| Dependencies | parent→child link auto-promotion | Task dependency queue | Graph edges + conditional routing | DAG dependencies | await + signal | Dependencies + conditions |
| Human-in-loop | Native block/unblock + comments | Requires custom HumanInputTool | Requires custom interrupt | Minimal (external sensor) | Native signal + activity | Minimal |
| Crash recovery | Claim TTL + reclaim + run history | No native support | Checkpoint replay | Retry policies | Event sourcing full replay | Retry + cached results |
| Cross-agent coord | Peer — any profile reads/writes any task | Sequential/hierarchical flow | Graph nodes implicitly pass state | Task instance context | Workflow signal/child | Flow references |
| Integration model | CLI + REST + WebSocket + tool calls | Python SDK | Python SDK (LangChain) | Python SDK + REST API | Multi-language SDK + gRPC | Python SDK + REST |
| Multi-tenancy | Native tenant tag + board isolation | None | None | None native | Namespaces | None native |
| Live UI | Built-in dashboard plugin | None native | LangGraph Studio (paid) | Flower monitoring | Web UI | Cloud UI / open-source |
| Pricing | Open source, free | Open source, free | Open source free / Studio paid | Open source, free | Open source free / Cloud paid | Open source free / Cloud paid |
| Best for | Multi-agent persistent collab + human-in-loop | Fast prototyping | Complex graph state flows | Scheduled batch processing | Long transactions / microservice orchestration | Lightweight data pipelines |
Detailed Analysis
CrewAI puts orchestration into Python objects — Agent, Task, Crew as code. Fastest to start, but state lives in memory. Process dies, state dies. Great for proofs of concept and one-shot pipelines.
LangGraph models agent flows as stateful graphs. Conditional edges enable complex branching logic, and checkpoint-based interrupt/resume is a real step above CrewAI. But its "graph as code" model couples orchestration to business logic — changing flow means changing code and redeploying. Studio is a paid product; free-tier debugging is limited.
Airflow is the king of scheduled batch processing. Declarative DAGs, reliable scheduler, huge ecosystem. But it has no native human-in-the-loop primitive — a task requiring human approval must be simulated with external sensors, which is fragile and awkward.
Temporal is the closest philosophical competitor to Hermes Kanban. Event sourcing, automatic crash replay, native support for long transactions and signal interrupts. Steep learning curve (workflow code has strict constraints), but for microservice orchestration and cross-day/cross-week transactions, it's the most robust choice.
Prefect sits between Airflow's weight and CrewAI's lightness — dynamic DAGs, Python-native decorators, decent dashboard. Human-in-the-loop and cross-agent audit trail are weak.
Hermes Kanban's differentiators come down to three things: (1) human-in-the-loop as a first-class citizen — block/unblock/comment isn't bolted on, it's the coordination primitive; (2) peer agent coordination — any profile can read/write any task, not just parent→child; (3) structured handoff — summary + metadata as JSON means downstream agents get a parseable handoff without scraping prose.
10-Minute Quickstart
# 1. Initialize
hermes kanban init
# 2. Make sure gateway is running (hosts the embedded dispatcher)
hermes gateway start
# 3. Create a research task
hermes kanban create "research AI agent orchestration patterns" \
--assignee researcher --priority 2
# 4. Create a dependency chain
SCHEMA=$(hermes kanban create "Design auth schema" \
--assignee backend-dev --json | jq -r .id)
hermes kanban create "Implement auth API" \
--assignee backend-dev --parent $SCHEMA
# 5. Watch in real time
hermes kanban watch
# 6. Check stats
hermes kanban stats
# 7. Open the dashboard
hermes dashboard # click the Kanban tab
From gateway chat
/kanban list
/kanban create "write launch post" --assignee writer --parent t_research
/kanban comment t_abcd "use the 2026 schema, not 2025"
/kanban unblock t_abcd
Auto-subscribe on create — you get notified when the task completes or blocks.
When to Use What
| Scenario | Recommendation |
|---|---|
| Quick parallel subtask, no persistence needed | delegate_task |
| Multi-role collaboration with human approval gates | Kanban |
| Scheduled batch processing, data pipelines | Airflow / Prefect |
| Long transactions across microservices | Temporal |
| Complex graph state flows, conditional branching | LangGraph |
| Fast prototype, one-shot pipeline | CrewAI |
Sources
Hermes Agent:
- Repository: https://github.com/NousResearch/hermes-agent (MIT, ⭐7k+)
- Documentation: https://hermes-agent.nousresearch.com/docs/user-guide/features/kanban
- Key source files inspected:
hermes_cli/kanban_db.py— SQLite kanban core, 4000+ lines, state machine, claim, dispatchhermes_cli/kanban.py— CLI subcommand entry pointhermes_cli/kanban_diagnostics.py— diagnostics and circuit breakertools/kanban_tools.py— Agent-side kanban_* tool definitionsplugins/kanban/dashboard/plugin_api.py— Dashboard REST/WS pluginwebsite/docs/user-guide/features/kanban.md— Official docs (776 lines)website/docs/user-guide/features/kanban-tutorial.md— Official tutorial
CrewAI:
- Repository: https://github.com/crewAIInc/crewAI (MIT, ⭐30k+)
- Documentation: https://docs.crewai.com/
LangGraph:
- Repository: https://github.com/langchain-ai/langgraph (MIT, ⭐10k+)
- Documentation: https://langchain-ai.github.io/langgraph/
Apache Airflow:
- Repository: https://github.com/apache/airflow (Apache-2.0, ⭐40k+)
- Documentation: https://airflow.apache.org/docs/
Temporal:
- Repository: https://github.com/temporalio/temporal (MIT, ⭐12k+)
- Documentation: https://docs.temporal.io/
Prefect:
- Repository: https://github.com/PrefectHQ/prefect (Apache-2.0, ⭐18k+)
- Documentation: https://docs.prefect.io/
shujietai (数据台):
- Project path: /home/guancy/workspace/shujietai (internal project)
- Key files:
backend/app/services/dispatch_service.py— dispatch state machine and CRUDbackend/app/services/dispatch_worker.py— async worker streaming AI responsesbackend/app/api/routes_dispatch.py— REST endpointsfrontend/src/composables/useDispatchTask.js— frontend task lifecycle