Image Generation Agent Skills: Code-Driven Images First, Pixel Images Only When Necessary
Image generation is one of the most token-consuming aspects of AI Agents. In a blog scenario, a single 1024×1024 pixel image may consume thousands of tokens, while a well-designed SVG architecture diagram needs only a few hundred characters. Based on an audit of our 9 existing visual skills and competitive analysis, we propose a "code-driven images first, pixel images only when necessary" cost optimization strategy.
Image Generation Skills on GitHub
GitHub image-generation agent skills mainly appear in framework tool layers:
- LangChain Tools: dall-e-tool, stable-diffusion-tool, wrapping API calls
- AutoGPT/AgentGPT plugins: image-gen plugin, calling DALL-E/SD APIs
- CrewAI Tools: DallETool, StableDiffusionTool
- Semantic Kernel Plugins: ImageGenPlugin (DALL-E 3)
Common trait: essentially paid API wrappers, inapplicable per our policy (paid API = auto SKIP).
Our Existing Visual Skill Capability Matrix
| Skill | Output Type | Token Cost | Use Case |
|---|---|---|---|
| architecture-diagram | Dark-themed SVG | Very low | System/cloud/infra diagrams |
| baoyu-infographic | Infographic | Low | Data visualization/comparison |
| baoyu-comic | Knowledge comic | Medium | Tutorial/biography/story |
| p5js | Generative art/interactive | Very low | Creative/art/shaders |
| pixel-art | Pixel art | Low | Retro style/games |
| claude-design | HTML prototype | Low | Landing pages/UI design |
| sketch | HTML mockup | Low | Quick prototype comparison |
| excalidraw | Hand-drawn diagram | Low | Whiteboard/flowcharts |
| comfyui | Pixel image (local SD/Flux) | Medium-High | Photorealistic/stylized |
Key finding: We have 9 visual skills, 7 of which output code-driven images (SVG/HTML/Canvas) with extremely low token cost. The only pixel image generator is comfyui (local ComfyUI), free to run but GPU-intensive.
Competitive Comparison
| Dimension | Ours | DALL-E API | SD Local | Midjourney | Flux | Recraft |
|---|---|---|---|---|---|---|
| Text-to-image | comfyui | Yes | Yes | Yes | Yes | Yes |
| Paid API | No (policy) | $0.04-0.12/img | Free (local) | $10+/mo | Free (local) | Free/Paid |
| Photorealistic | Missing | Strong | Strong | Strong | Very strong | Strong |
| Vector/SVG | baoyu-infographic | No | No | No | No | Yes |
| Architecture diagram | architecture-diagram | No | No | No | No | No |
| Code-driven | p5js/flowforge | No | No | No | No | No |
| Offline/privacy | comfyui | No | Yes | No | Yes | No |
What are we missing? Photorealistic text-to-image (comfyui with local models can fill this, but requires pre-installed large models) and a unified text-to-image interface (current skills operate independently).
Blog Image Cost Optimization Strategy
Based on token consumption analysis and user feedback, blog images should follow these principles:
Principle 1: Code-Driven Images First
SVG/HTML/Canvas images cost 1-2 orders of magnitude fewer tokens than pixel images. Use architecture-diagram (SVG) for architecture, baoyu-infographic for data comparison, excalidraw for flowcharts. Only consider pixel images when code-driven images can't express the concept.
Principle 2: Only When Necessary for Reading Quality
Not every article needs images. Purely textual technical analysis can be zero-image. Images should serve one of these purposes:
- Explain spatial relationships hard to describe in text (architecture diagrams)
- Show data comparison trends (infographics)
- Enhance memory anchors (visualization of key concepts)
- Demonstrate actual UI/visual effects (screenshots)
Principle 3: Minimize Pixel Images
When pixel images are unavoidable:
- Limit resolution to 512×512 or 768×512, prohibit 1024+
- Use WebP format instead of PNG, 50%+ size reduction
- ComfyUI economy mode: 15-20 steps, reduced CFG
- Generate once for same theme, reference across posts
Principle 4: Cache and Reuse
Set up an image cache library. Generate once for same theme/concept, reference across blog posts. Avoid regenerating similar image types.
Conclusion
GitHub image-generation skills are essentially paid API wrappers that we don't need to replicate. The core gaps are photorealistic text-to-image and a unified interface, but neither is essential for the blog scenario. Adhering to "code-driven images first, pixel images minimized, only when necessary for reading quality" significantly reduces token cost without sacrificing content quality.
Sources:
- ComfyUI: https://github.com/comfyanonymous/ComfyUI (GPL-3.0, 70k+ stars)
- DALL-E API: https://platform.openai.com/docs/guides/images (Paid)
- Stable Diffusion: https://github.com/Stability-AI/stablediffusion (CreativeML, 38k+ stars)
- Flux: https://github.com/black-forest-labs/flux (Apache-2.0)
- Recraft: https://www.recraft.ai/ (Free+Paid)
- architecture-diagram: Hermes Agent built-in skill
- baoyu-infographic: Hermes Agent built-in skill
- comfyui: Hermes Agent built-in skill