aiGalen Guan

Skill Creator Ecosystem: From 180K Installs to Hermes Skill Graph 2.0

AI agents are only as capable as the knowledge they carry. Skill systems — modular, self-contained packages that extend agent capabilities — have become the defining differentiator between a generic chatbot and a specialized assistant. But not all skill systems are created equal.

After surveying eight skill categories across the ecosystem, three skill creation frameworks emerged as the heavyweights: Anthropic's Skill Creator (182.8K installs), OpenClaw/Codex's Skill Creator, and our own Hermes Skill Graph 2.0. Each represents a distinct philosophy about how agent knowledge should be packaged, discovered, and composed.

Skills Ecosystem Survey Scorecard

Architecture Philosophy

Anthropic Skill Creator: The 500-Line Blueprint

Anthropic's approach is pragmatic and production-tested. At 182.8K installs, it's the dominant standard. The philosophy is clear: every skill is a folder with a required SKILL.md file and optional bundled resources (scripts, references, assets). The description frontmatter field is the primary triggering mechanism — it's the advertising copy that convinces Claude to consult the skill.

The framework emphasizes three key patterns:

  • Progressive Disclosure: metadata always loaded (~100 words), body on trigger (<500 lines), resources on demand
  • Domain Organization: skills supporting multiple frameworks organize by variant (AWS, GCP, Azure each get their own reference file)
  • Evaluation-Driven Iteration: a formal loop of drafting, running test cases with baselines, quantitative benchmarking, and description optimization

The strengths are obvious. The evaluation methodology is rigorous — parallel subagent runs with/without the skill, grading by specialized grader agents, and a browser-based review viewer. The description optimization pipeline (20 eval queries, train/test split, iterative refinement) is genuinely sophisticated.

But the weaknesses are equally clear. The evaluation infrastructure is heavy — it requires subagents, Python scripts, HTML viewers, and multi-iteration benchmarking. It's designed for Claude Code, with claude.ai getting a stripped-down version that skips benchmarking and blind comparison. The 500-line limit pushes for lean skills, but the evaluation overhead means teams often skip the optimization phase entirely.

Most critically from a composability standpoint: Anthropic skills are flat. The requires_skills and feeds_into edges that define a graph are absent. Each skill is an island — the framework provides no mechanism for declaring that "this skill depends on that skill" or "this skill produces output consumed by that skill." In practice, this means every skill carries redundant instructions because it can't assume any other skill will be loaded.

OpenClaw/Codex Skill Creator: The Ecosystem Builder

OpenClaw's steipete/clawdis@skill-creator (and its twin in the Codex ecosystem) takes a different approach. Where Anthropic focuses on evaluation rigor, OpenClaw focuses on ecosystem scale.

The OpenClaw skill library is massive — 50+ skills covering email, smart home, social media, music, notes, and productivity. The frontmatter includes rich metadata: emoji for visual identification, os constraints for platform awareness, requires.bins for dependency validation, and install objects with auto-install instructions for package managers.

The strength is discoverability and operational readiness. Each skill knows what it needs and can tell the platform how to get it. The metadata.openclaw structure enables automated compatibility checking and one-click installation.

The weakness? Composition is virtually non-existent. Like Anthropic, OpenClaw skills are standalone modules. There's no tier system, no dependency graph, and no concept of skill composition. A tmux skill can't declare that it depends on a session-management skill — it must redundantly include those instructions or hope the agent knows them already.

Codex's version adds init_skill.py and package_skill.py scripts for scaffolding and distribution, and the philosophy of "Codex is already very smart" pushes for extreme conciseness. But the architectural weakness remains: these are tools for creating individual skills, not for building skill systems.

Hermes Skill Graph 2.0: Composition as First Principle

Hermes takes the synthesis route. Instead of choosing between evaluation rigor and ecosystem scale, it asks: what if skills composed like software?

The four-tier system (atoms, molecules, compounds, runbooks) is the architectural innovation. An atom is a single-purpose capability that never calls other skills. A molecule chains 2-10 atoms with explicit instructions. A compound orchestrates multiple molecules with human checkpoints between phases. A runbook captures one-off recovery recipes.

The key difference is the dependency graph. Every molecule declares its requires_skills and feeds_into edges. This transforms a flat collection of files into a navigable graph where loading one compound automatically pulls in 10-50 atoms of supporting context. The result: 5 compounds can produce 500 atomic work units with the same cognitive load as driving 5 atoms directly.

The adaptation methodology is also distinctive. When integrating third-party skills, Hermes enforces: cost model check FIRST (before source review), security audit, tier mapping, rewritten frontmatter, and zero-modification script copying from upstream. This is hard-won wisdom — the Skywork project invested significant effort integrating 7 skills only to remove them all because $19.99/month made them unusable.

The weakness of this approach is upfront investment. Building a solid atom library requires discipline — every atom must be rock-solid before any molecule can be reliable, and every molecule must explicitly declare its dependencies. The worktree isolation requirement (all git operations happen in isolated branches) adds ceremony. But the payoff is leverage: once the base is solid, driving compounds produces exponential returns.

Where Hermes Differs Fundamentally

Dimension Anthropic OpenClaw/Codex Hermes 2.0
Composition Flat — no dependency graph Flat — no dependency graph Tiered graph with requires_skills and feeds_into
Evaluation Rigorous: parallel runs, baselines, benchmarking Lightweight: validation scripts, packaging Integrated: code review, security audit, forensics
Third-party integration Not addressed Not addressed Explicit methodology with cost-model-first filtering
Trigger mechanism Description optimization pipeline Description field + agent inference Description + AGENTS.md triggers + skill routing
Scaling strategy More skills = more options More skills = more surface area Better composition = exponential leverage
Target audience Claude Code / claude.ai Codex CLI / OpenClaw Hermes Agent (terminal-first)

The comparison reveals a deeper pattern: the skill ecosystem is maturing from "more skills" to "better composition." Anthropic's 182K installs prove the demand. OpenClaw's 50+ skills prove the surface area. Hermes proves that composition transforms both — 50 well-composed skills produce far more leverage than 500 standalone ones.

The Leverage Principle in Practice

This is not theoretical. In our local skill library, the blog publishing pipeline demonstrates the difference concretely:

guancyxx-blog-playbook (compound — 1 entry point)
├── blog-content-authoring (molecule — orchestrates 3 atoms)
├── blog-quality-gate (molecule — pre-commit + post-publish audit)
├── blog-diagram-insert (atom — excalidraw generation)
├── nextjs-seo-optimization (molecule — audit + optimization)
├── blog-build-deploy (molecule — docker rebuild + SSH deploy)
└── git-safe-commit-push (atom — safe git workflow)

Result: 1 compound → 4 molecules → 8+ atoms → 30+ distinct operations

Loading guancyxx-blog-playbook gives the agent access to the entire publishing pipeline. The alternative — loading each atom individually — would require 8 separate invocations and miss the decision logic that routes between "quick post," "quality post," and "full pipeline" paths.

Recommendations

If you're building a skill ecosystem or choosing a framework:

  1. For quick iteration and evaluation rigor: Anthropic's framework is battle-tested. Use it if you need formal benchmarking and have the infrastructure for parallel subagent runs.

  2. For ecosystem breadth and discoverability: OpenClaw's metadata-rich format enables automatic dependency resolution and one-click installation. Use it if you value operational readiness over composition.

  3. For composable, scaling systems: Hermes Skill Graph 2.0 is the right choice when you want leverage — where loading one compound unlocks dozens of atomic operations. The upfront investment in tier discipline pays off exponentially as the skill library grows.

The future of agent skills is not more skills — it's better composition. A skill that knows how to chain with other skills is worth ten standalone ones. That's the lesson the Hermes adaptation of quality-playbook demonstrates: sometimes the best skill isn't the one with the most installs, but the one that integrates most cleanly into a graph that already exists.