aiGalen Guan

Graphify: Turn Any Folder into a Queryable Knowledge Graph

If you've ever stared at a folder full of research papers, code files, screenshots, and notes and wished you could just ask it questions — graphify might be exactly what you need. With 37.6k GitHub stars and 4.2k forks, it's one of the fastest-growing developer tools of 2026.

What Graphify Does

Graphify turns any folder of files into a queryable knowledge graph. The concept comes from Andrej Karpathy's /raw folder workflow: drop anything into a directory — papers, tweets, screenshots, code, notes — and get a structured knowledge graph that reveals connections you didn't know were there.

The pipeline is clean and deterministic:

detect() → extract() → build_graph() → cluster() → analyze() → report() → export()

Each stage is a single function in its own module. They communicate through plain Python dicts and NetworkX graphs — no shared state, no side effects outside the output directory.

Three Things Graphify Does That LLMs Alone Cannot

  1. Persistent graph — Relationships are stored in graph.json and survive across sessions. Ask questions weeks later without re-reading everything.
  2. Honest audit trail — Every edge is tagged EXTRACTED, INFERRED, or AMBIGUOUS. You always know what was found vs. guessed.
  3. Cross-document surprise — Community detection (Leiden algorithm) finds connections between concepts in different files that you'd never think to ask about directly.

Multimodal Input Support

Graphify is fully multimodal. It doesn't just handle code:

Type Extensions Extraction Method
Code .py .ts .js .go .rs .java .c .cpp .rb .cs .kt .scala .php .swift .lua .zig .ps1 .ex .mm .jl .v AST via tree-sitter + call-graph pass
Docs .md .txt .rst .html .mdx Concepts + relationships via LLM
Papers .pdf Citation mining + concept extraction
Images .png .jpg .webp .gif LLM vision — screenshots, diagrams, any language
Video/Audio .mp4 .mp3 .wav .mkv Whisper transcription → treat as docs

The tree-sitter support covers 23 languages out of the box. This is a significant advantage over tools that rely purely on LLM extraction — for code, AST extraction is deterministic, free (no token cost), and perfectly accurate for structural relationships like imports and call graphs.

The Extraction Pipeline in Detail

Structural Extraction (AST)

For code files, graphify uses tree-sitter to parse the AST and extract:

  • Nodes: functions, classes, methods, modules
  • Edges: imports_from, calls, implements, uses

This pass is deterministic and costs zero tokens. A second call-graph pass adds INFERRED edges for indirect call relationships.

Semantic Extraction (LLM)

For docs, papers, and images, graphify dispatches parallel subagents to extract:

  • Nodes: named concepts, entities, citations
  • Edges: cites, conceptually_related_to, semantically_similar_to, rationale_for
  • Hyperedges: groups of 3+ nodes participating in a shared concept

There's also an opt-in Kimi K2.6 backend (via MOONSHOT_API_KEY) that extracts 3-6x richer relations at ~3x less cost per token compared to Claude.

Confidence Labels: The Honesty System

This is what sets graphify apart from naive RAG systems:

Label Meaning
EXTRACTED Relationship is explicitly stated in the source (import statement, direct call, citation)
INFERRED Reasonable deduction (shared data structure, implied dependency, co-occurrence)
AMBIGUOUS Uncertain — flagged for human review

Every edge also carries a confidence_score between 0.1 and 1.0. The report never hides uncertainty behind symbols — raw numbers are always shown.

Output Artifacts

Running graphify . produces:

graphify-out/
├── graph.html          Interactive graph — click nodes, search, filter by community
├── obsidian/           Open as Obsidian vault (opt-in via --obsidian)
├── wiki/               Wikipedia-style articles for agent navigation (--wiki)
├── GRAPH_REPORT.md     God nodes, surprising connections, suggested questions
├── graph.json          Persistent graph — query weeks later without re-reading
└── cache/              SHA256 cache — re-runs only process changed files

The GRAPH_REPORT.md is particularly useful — it contains:

  • God nodes: highest-degree concepts (what everything connects through)
  • Surprising connections: ranked by composite score, with plain-English explanations
  • Suggested questions: 4-5 questions the graph is uniquely positioned to answer

Token Reduction Benchmark

On a mixed corpus (Karpathy repos + 5 papers + 4 images, 52 files total): 71.5x fewer tokens per query vs reading the raw files. Token reduction scales with corpus size — at 6 files the graph value is structural clarity, not compression.

Installation and Usage

pip install graphifyy && graphify install

The PyPI package is temporarily named graphifyy while the graphify name is being reclaimed. The CLI and skill command are still graphify.

Basic usage:

/graphify .                        # full pipeline on current directory
/graphify ./raw                    # on a specific folder
/graphify ./raw --mode deep        # more aggressive INFERRED edge extraction
/graphify ./raw --update           # incremental - re-extract only changed files
/graphify add https://arxiv.org/abs/1706.03762   # fetch paper, update graph
/graphify query "what connects attention to the optimizer?"
/graphify path "DigestAuth" "Response"
/graphify explain "SwinTransformer"

Can You Integrate It Into Your Workflow?

Short answer: yes, and it's already designed for it.

For Claude Code / Codex / OpenCode / Cursor / Gemini CLI / Hermes

Graphify v5 explicitly lists Hermes as a supported platform in its pyproject.toml description. The graphify install command sets up the skill for your agent platform. The skill.md file is a 61K-character orchestrating document that guides the agent through the full pipeline step by step.

MCP Server Integration

With --mcp, graphify starts a stdio MCP server exposing tools: query_graph, get_node, get_neighbors, get_community, god_nodes, graph_stats, shortest_path. This means any MCP-compatible agent can query the graph live:

python3 -m graphify.serve graphify-out/graph.json

Git Hook for Auto-Rebuild

graphify hook install    # installs post-commit hook

After every git commit, the hook detects changed code files, re-runs AST extraction, and rebuilds the graph. No background process needed.

Watch Mode for Agentic Workflows

graphify . --watch       # auto-sync graph as files change

Code file saves trigger an instant rebuild (AST only, no LLM). Doc/image changes notify you to run --update for the LLM re-pass. Useful when multiple agents are writing code in parallel.

Security Model

Graphify is a local development tool with no server component (unless you explicitly start the MCP server). All external input passes through security.py:

  • URLs validated to http/https only, blocks private IPs and cloud metadata endpoints
  • Content size caps at 50 MB for downloads, 10 MB for text
  • Path traversal blocked — graph paths must resolve inside graphify-out/
  • XSS prevention — all node labels are HTML-escaped before visualization
  • Prompt injection protection — labels sanitized in MCP text output

The project is MIT-licensed and actively maintained (214 commits, 70 tags, current version 0.5.5 on the v5 branch).

Practical Integration Paths

For developers already using AI coding assistants, here are concrete integration paths:

  1. As a skill — Install via graphify install, then type /graphify . in your agent to build a graph of your project. The agent understands the graph and can answer structural questions about your codebase.

  2. As an MCP server — Add the MCP server to your agent's config for live graph querying. Other agents can discover relationships without reading every file.

  3. As a libraryimport graphify gives you programmatic access to extract(), build_from_json(), cluster(), god_nodes(), etc. Build custom pipelines.

  4. As a CI/CD step — The git hook or watch mode keeps the graph current automatically. Point your documentation generator or onboarding tool at GRAPH_REPORT.md.

Limitations and Caveats

  • Requires Python 3.10–3.13 — tree-sitter bindings don't support 3.14 yet
  • Semantic extraction costs tokens — the AST pass is free, but doc/paper/image extraction uses LLM calls. Budget accordingly.
  • Large graphs (>5000 nodes) — HTML visualization switches to aggregated community view. Full node-level detail via Obsidian vault export.
  • Corpus size warning — Over 200 files or 2M words, graphify asks you to pick a subdirectory first.
  • The PyPI name is graphifyy (double-y) — temporary while the original name is reclaimed.

Conclusion

Graphify solves a real and growing problem: as our personal and professional knowledge bases grow to encompass code, papers, notes, screenshots, and documents, we need a way to navigate the connections between them — not just search within each file. The combination of deterministic AST extraction, honest confidence labeling, community detection, and persistent graph storage makes it a genuinely useful tool, not just a demo.

The 71x token reduction benchmark is compelling, but the real value is structural: you discover connections you didn't know existed. That's worth more than any compression ratio.


Sources: