Graphify: Turn Any Folder into a Queryable Knowledge Graph
If you've ever stared at a folder full of research papers, code files, screenshots, and notes and wished you could just ask it questions — graphify might be exactly what you need. With 37.6k GitHub stars and 4.2k forks, it's one of the fastest-growing developer tools of 2026.
What Graphify Does
Graphify turns any folder of files into a queryable knowledge graph. The concept comes from Andrej Karpathy's /raw folder workflow: drop anything into a directory — papers, tweets, screenshots, code, notes — and get a structured knowledge graph that reveals connections you didn't know were there.
The pipeline is clean and deterministic:
detect() → extract() → build_graph() → cluster() → analyze() → report() → export()
Each stage is a single function in its own module. They communicate through plain Python dicts and NetworkX graphs — no shared state, no side effects outside the output directory.
Three Things Graphify Does That LLMs Alone Cannot
- Persistent graph — Relationships are stored in
graph.jsonand survive across sessions. Ask questions weeks later without re-reading everything. - Honest audit trail — Every edge is tagged
EXTRACTED,INFERRED, orAMBIGUOUS. You always know what was found vs. guessed. - Cross-document surprise — Community detection (Leiden algorithm) finds connections between concepts in different files that you'd never think to ask about directly.
Multimodal Input Support
Graphify is fully multimodal. It doesn't just handle code:
| Type | Extensions | Extraction Method |
|---|---|---|
| Code | .py .ts .js .go .rs .java .c .cpp .rb .cs .kt .scala .php .swift .lua .zig .ps1 .ex .mm .jl .v |
AST via tree-sitter + call-graph pass |
| Docs | .md .txt .rst .html .mdx |
Concepts + relationships via LLM |
| Papers | .pdf |
Citation mining + concept extraction |
| Images | .png .jpg .webp .gif |
LLM vision — screenshots, diagrams, any language |
| Video/Audio | .mp4 .mp3 .wav .mkv |
Whisper transcription → treat as docs |
The tree-sitter support covers 23 languages out of the box. This is a significant advantage over tools that rely purely on LLM extraction — for code, AST extraction is deterministic, free (no token cost), and perfectly accurate for structural relationships like imports and call graphs.
The Extraction Pipeline in Detail
Structural Extraction (AST)
For code files, graphify uses tree-sitter to parse the AST and extract:
- Nodes: functions, classes, methods, modules
- Edges:
imports_from,calls,implements,uses
This pass is deterministic and costs zero tokens. A second call-graph pass adds INFERRED edges for indirect call relationships.
Semantic Extraction (LLM)
For docs, papers, and images, graphify dispatches parallel subagents to extract:
- Nodes: named concepts, entities, citations
- Edges:
cites,conceptually_related_to,semantically_similar_to,rationale_for - Hyperedges: groups of 3+ nodes participating in a shared concept
There's also an opt-in Kimi K2.6 backend (via MOONSHOT_API_KEY) that extracts 3-6x richer relations at ~3x less cost per token compared to Claude.
Confidence Labels: The Honesty System
This is what sets graphify apart from naive RAG systems:
| Label | Meaning |
|---|---|
EXTRACTED |
Relationship is explicitly stated in the source (import statement, direct call, citation) |
INFERRED |
Reasonable deduction (shared data structure, implied dependency, co-occurrence) |
AMBIGUOUS |
Uncertain — flagged for human review |
Every edge also carries a confidence_score between 0.1 and 1.0. The report never hides uncertainty behind symbols — raw numbers are always shown.
Output Artifacts
Running graphify . produces:
graphify-out/
├── graph.html Interactive graph — click nodes, search, filter by community
├── obsidian/ Open as Obsidian vault (opt-in via --obsidian)
├── wiki/ Wikipedia-style articles for agent navigation (--wiki)
├── GRAPH_REPORT.md God nodes, surprising connections, suggested questions
├── graph.json Persistent graph — query weeks later without re-reading
└── cache/ SHA256 cache — re-runs only process changed files
The GRAPH_REPORT.md is particularly useful — it contains:
- God nodes: highest-degree concepts (what everything connects through)
- Surprising connections: ranked by composite score, with plain-English explanations
- Suggested questions: 4-5 questions the graph is uniquely positioned to answer
Token Reduction Benchmark
On a mixed corpus (Karpathy repos + 5 papers + 4 images, 52 files total): 71.5x fewer tokens per query vs reading the raw files. Token reduction scales with corpus size — at 6 files the graph value is structural clarity, not compression.
Installation and Usage
pip install graphifyy && graphify install
The PyPI package is temporarily named
graphifyywhile thegraphifyname is being reclaimed. The CLI and skill command are stillgraphify.
Basic usage:
/graphify . # full pipeline on current directory
/graphify ./raw # on a specific folder
/graphify ./raw --mode deep # more aggressive INFERRED edge extraction
/graphify ./raw --update # incremental - re-extract only changed files
/graphify add https://arxiv.org/abs/1706.03762 # fetch paper, update graph
/graphify query "what connects attention to the optimizer?"
/graphify path "DigestAuth" "Response"
/graphify explain "SwinTransformer"
Can You Integrate It Into Your Workflow?
Short answer: yes, and it's already designed for it.
For Claude Code / Codex / OpenCode / Cursor / Gemini CLI / Hermes
Graphify v5 explicitly lists Hermes as a supported platform in its pyproject.toml description. The graphify install command sets up the skill for your agent platform. The skill.md file is a 61K-character orchestrating document that guides the agent through the full pipeline step by step.
MCP Server Integration
With --mcp, graphify starts a stdio MCP server exposing tools: query_graph, get_node, get_neighbors, get_community, god_nodes, graph_stats, shortest_path. This means any MCP-compatible agent can query the graph live:
python3 -m graphify.serve graphify-out/graph.json
Git Hook for Auto-Rebuild
graphify hook install # installs post-commit hook
After every git commit, the hook detects changed code files, re-runs AST extraction, and rebuilds the graph. No background process needed.
Watch Mode for Agentic Workflows
graphify . --watch # auto-sync graph as files change
Code file saves trigger an instant rebuild (AST only, no LLM). Doc/image changes notify you to run --update for the LLM re-pass. Useful when multiple agents are writing code in parallel.
Security Model
Graphify is a local development tool with no server component (unless you explicitly start the MCP server). All external input passes through security.py:
- URLs validated to http/https only, blocks private IPs and cloud metadata endpoints
- Content size caps at 50 MB for downloads, 10 MB for text
- Path traversal blocked — graph paths must resolve inside
graphify-out/ - XSS prevention — all node labels are HTML-escaped before visualization
- Prompt injection protection — labels sanitized in MCP text output
The project is MIT-licensed and actively maintained (214 commits, 70 tags, current version 0.5.5 on the v5 branch).
Practical Integration Paths
For developers already using AI coding assistants, here are concrete integration paths:
-
As a skill — Install via
graphify install, then type/graphify .in your agent to build a graph of your project. The agent understands the graph and can answer structural questions about your codebase. -
As an MCP server — Add the MCP server to your agent's config for live graph querying. Other agents can discover relationships without reading every file.
-
As a library —
import graphifygives you programmatic access toextract(),build_from_json(),cluster(),god_nodes(), etc. Build custom pipelines. -
As a CI/CD step — The git hook or watch mode keeps the graph current automatically. Point your documentation generator or onboarding tool at
GRAPH_REPORT.md.
Limitations and Caveats
- Requires Python 3.10–3.13 — tree-sitter bindings don't support 3.14 yet
- Semantic extraction costs tokens — the AST pass is free, but doc/paper/image extraction uses LLM calls. Budget accordingly.
- Large graphs (>5000 nodes) — HTML visualization switches to aggregated community view. Full node-level detail via Obsidian vault export.
- Corpus size warning — Over 200 files or 2M words, graphify asks you to pick a subdirectory first.
- The PyPI name is
graphifyy(double-y) — temporary while the original name is reclaimed.
Conclusion
Graphify solves a real and growing problem: as our personal and professional knowledge bases grow to encompass code, papers, notes, screenshots, and documents, we need a way to navigate the connections between them — not just search within each file. The combination of deterministic AST extraction, honest confidence labeling, community detection, and persistent graph storage makes it a genuinely useful tool, not just a demo.
The 71x token reduction benchmark is compelling, but the real value is structural: you discover connections you didn't know existed. That's worth more than any compression ratio.
Sources: