Image Generation Agent Skills: Code-Driven Images First, Pixel Images Only When Necessary

Image generation is one of the most token-consuming aspects of AI Agents. In a blog scenario, a single 1024×1024 pixel image may consume thousands of tokens, while a well-designed SVG architecture diagram needs only a few hundred characters. Based on an audit of our 9 existing visual skills and competitive analysis, we propose a "code-driven images first, pixel images only when necessary" cost optimization strategy.

Image Generation Strategy

Image Generation Skills on GitHub

GitHub image-generation agent skills mainly appear in framework tool layers:

LangChain Tools: dall-e-tool, stable-diffusion-tool, wrapping API calls
AutoGPT/AgentGPT plugins: image-gen plugin, calling DALL-E/SD APIs
CrewAI Tools: DallETool, StableDiffusionTool
Semantic Kernel Plugins: ImageGenPlugin (DALL-E 3)

Common trait: essentially paid API wrappers, inapplicable per our policy (paid API = auto SKIP).

Our Existing Visual Skill Capability Matrix

Skill	Output Type	Token Cost	Use Case
architecture-diagram	Dark-themed SVG	Very low	System/cloud/infra diagrams
baoyu-infographic	Infographic	Low	Data visualization/comparison
baoyu-comic	Knowledge comic	Medium	Tutorial/biography/story
p5js	Generative art/interactive	Very low	Creative/art/shaders
pixel-art	Pixel art	Low	Retro style/games
claude-design	HTML prototype	Low	Landing pages/UI design
sketch	HTML mockup	Low	Quick prototype comparison
excalidraw	Hand-drawn diagram	Low	Whiteboard/flowcharts
comfyui	Pixel image (local SD/Flux)	Medium-High	Photorealistic/stylized

Key finding: We have 9 visual skills, 7 of which output code-driven images (SVG/HTML/Canvas) with extremely low token cost. The only pixel image generator is comfyui (local ComfyUI), free to run but GPU-intensive.

Competitive Comparison

Dimension	Ours	DALL-E API	SD Local	Midjourney	Flux	Recraft
Text-to-image	comfyui	Yes	Yes	Yes	Yes	Yes
Paid API	No (policy)	$0.04-0.12/img	Free (local)	$10+/mo	Free (local)	Free/Paid
Photorealistic	Missing	Strong	Strong	Strong	Very strong	Strong
Vector/SVG	baoyu-infographic	No	No	No	No	Yes
Architecture diagram	architecture-diagram	No	No	No	No	No
Code-driven	p5js/flowforge	No	No	No	No	No
Offline/privacy	comfyui	No	Yes	No	Yes	No

What are we missing? Photorealistic text-to-image (comfyui with local models can fill this, but requires pre-installed large models) and a unified text-to-image interface (current skills operate independently).

Blog Image Cost Optimization Strategy

Based on token consumption analysis and user feedback, blog images should follow these principles:

Principle 1: Code-Driven Images First

SVG/HTML/Canvas images cost 1-2 orders of magnitude fewer tokens than pixel images. Use architecture-diagram (SVG) for architecture, baoyu-infographic for data comparison, excalidraw for flowcharts. Only consider pixel images when code-driven images can't express the concept.

Principle 2: Only When Necessary for Reading Quality

Not every article needs images. Purely textual technical analysis can be zero-image. Images should serve one of these purposes:

Explain spatial relationships hard to describe in text (architecture diagrams)
Show data comparison trends (infographics)
Enhance memory anchors (visualization of key concepts)
Demonstrate actual UI/visual effects (screenshots)

Principle 3: Minimize Pixel Images

When pixel images are unavoidable:

Limit resolution to 512×512 or 768×512, prohibit 1024+
Use WebP format instead of PNG, 50%+ size reduction
ComfyUI economy mode: 15-20 steps, reduced CFG
Generate once for same theme, reference across posts

Principle 4: Cache and Reuse

Set up an image cache library. Generate once for same theme/concept, reference across blog posts. Avoid regenerating similar image types.

Conclusion

GitHub image-generation skills are essentially paid API wrappers that we don't need to replicate. The core gaps are photorealistic text-to-image and a unified interface, but neither is essential for the blog scenario. Adhering to "code-driven images first, pixel images minimized, only when necessary for reading quality" significantly reduces token cost without sacrificing content quality.

Sources:

ComfyUI: https://github.com/comfyanonymous/ComfyUI (GPL-3.0, 70k+ stars)
DALL-E API: https://platform.openai.com/docs/guides/images (Paid)
Stable Diffusion: https://github.com/Stability-AI/stablediffusion (CreativeML, 38k+ stars)
Flux: https://github.com/black-forest-labs/flux (Apache-2.0)
Recraft: https://www.recraft.ai/ (Free+Paid)
architecture-diagram: Hermes Agent built-in skill
baoyu-infographic: Hermes Agent built-in skill
comfyui: Hermes Agent built-in skill