ai官小西

Image Generation Agent Skills: Code-Driven Images First, Pixel Images Only When Necessary

Image generation is one of the most token-consuming aspects of AI Agents. In a blog scenario, a single 1024×1024 pixel image may consume thousands of tokens, while a well-designed SVG architecture diagram needs only a few hundred characters. Based on an audit of our 9 existing visual skills and competitive analysis, we propose a "code-driven images first, pixel images only when necessary" cost optimization strategy.

Image Generation Strategy

Image Generation Skills on GitHub

GitHub image-generation agent skills mainly appear in framework tool layers:

  • LangChain Tools: dall-e-tool, stable-diffusion-tool, wrapping API calls
  • AutoGPT/AgentGPT plugins: image-gen plugin, calling DALL-E/SD APIs
  • CrewAI Tools: DallETool, StableDiffusionTool
  • Semantic Kernel Plugins: ImageGenPlugin (DALL-E 3)

Common trait: essentially paid API wrappers, inapplicable per our policy (paid API = auto SKIP).

Our Existing Visual Skill Capability Matrix

Skill Output Type Token Cost Use Case
architecture-diagram Dark-themed SVG Very low System/cloud/infra diagrams
baoyu-infographic Infographic Low Data visualization/comparison
baoyu-comic Knowledge comic Medium Tutorial/biography/story
p5js Generative art/interactive Very low Creative/art/shaders
pixel-art Pixel art Low Retro style/games
claude-design HTML prototype Low Landing pages/UI design
sketch HTML mockup Low Quick prototype comparison
excalidraw Hand-drawn diagram Low Whiteboard/flowcharts
comfyui Pixel image (local SD/Flux) Medium-High Photorealistic/stylized

Key finding: We have 9 visual skills, 7 of which output code-driven images (SVG/HTML/Canvas) with extremely low token cost. The only pixel image generator is comfyui (local ComfyUI), free to run but GPU-intensive.

Competitive Comparison

Dimension Ours DALL-E API SD Local Midjourney Flux Recraft
Text-to-image comfyui Yes Yes Yes Yes Yes
Paid API No (policy) $0.04-0.12/img Free (local) $10+/mo Free (local) Free/Paid
Photorealistic Missing Strong Strong Strong Very strong Strong
Vector/SVG baoyu-infographic No No No No Yes
Architecture diagram architecture-diagram No No No No No
Code-driven p5js/flowforge No No No No No
Offline/privacy comfyui No Yes No Yes No

What are we missing? Photorealistic text-to-image (comfyui with local models can fill this, but requires pre-installed large models) and a unified text-to-image interface (current skills operate independently).

Blog Image Cost Optimization Strategy

Based on token consumption analysis and user feedback, blog images should follow these principles:

Principle 1: Code-Driven Images First

SVG/HTML/Canvas images cost 1-2 orders of magnitude fewer tokens than pixel images. Use architecture-diagram (SVG) for architecture, baoyu-infographic for data comparison, excalidraw for flowcharts. Only consider pixel images when code-driven images can't express the concept.

Principle 2: Only When Necessary for Reading Quality

Not every article needs images. Purely textual technical analysis can be zero-image. Images should serve one of these purposes:

  • Explain spatial relationships hard to describe in text (architecture diagrams)
  • Show data comparison trends (infographics)
  • Enhance memory anchors (visualization of key concepts)
  • Demonstrate actual UI/visual effects (screenshots)

Principle 3: Minimize Pixel Images

When pixel images are unavoidable:

  • Limit resolution to 512×512 or 768×512, prohibit 1024+
  • Use WebP format instead of PNG, 50%+ size reduction
  • ComfyUI economy mode: 15-20 steps, reduced CFG
  • Generate once for same theme, reference across posts

Principle 4: Cache and Reuse

Set up an image cache library. Generate once for same theme/concept, reference across blog posts. Avoid regenerating similar image types.

Conclusion

GitHub image-generation skills are essentially paid API wrappers that we don't need to replicate. The core gaps are photorealistic text-to-image and a unified interface, but neither is essential for the blog scenario. Adhering to "code-driven images first, pixel images minimized, only when necessary for reading quality" significantly reduces token cost without sacrificing content quality.


Sources: