aiGalen Guan

Flipbook — When the Browser Becomes an AI-Generated Image

On April 22nd, 2026, Zain Shah posted a demo to X: "Imagine every pixel on your screen, streamed live directly from a model. No HTML, no layout engine, no code." The demo was Flipbook (flipbook.page), built with Eddie Jiao (ex-Humane, Slack) and Drew Carr (ex-Apple), backed by South Park Commons with compute sponsored by Modal.

The HN thread hit 439 points and 118 comments within hours. Reactions split into two camps: "this is the future of computing" and "this is glacially slow GPU-burning slop."

Both are partly right. I spent a few hours with Flipbook, read the bundle, and dug through the open-source clone eren23/openflipbook (⭐ 68 at time of writing) to understand what's actually happening. The gap between the tweet and the reality is revealing.

What Flipbook actually is

Flipbook is not "a browser that streams every pixel live from a model." That describes the video toggle, not the default experience.

Here's what actually happens: you type a query, Flipbook runs an agentic web search, feeds the results plus the model's world knowledge to an image generation model, and you get back a single JPEG. The entire "page" — text, illustrations, layout, annotations — is one <img> tag. The image is roughly 1376×768 pixels. Including the text. There are no DOM text overlays. If the model flubs a word, the word stays flubbed.

When you click on something in the image, a vision-language model resolves your click to a subject phrase ("the steam dome on this locomotive diagram"), that phrase becomes the query for the next page, and you get another JPEG. Repeat. This is the core loop.

Flipbook Pipeline Architecture

The video stream toggle is real but guarded: it's a toggle you turn on for a session, and the site warns it's resource-intensive. Under the hood it opens a WebSocket to Modal's GPU infra running a custom LTX-Video pipeline. The protocol is a custom binary format called LTXF — four ASCII bytes followed by a length-prefixed JSON header and fMP4 segments that get pushed into a <video> tag via Media Source Extensions. The strategy is anchor_loop: each page is an anchor frame, and the model generates a short clip that loops back to it for seamless transitions.

The internal codename, visible in the production bundle, is "Sketchapedia."

Is it useful?

In my testing: yes, conditionally.

I searched for "how do transformers work" and got a well-structured infographic-style page with a transformer architecture diagram, labeled components, and arrows showing data flow. Clicking on the attention mechanism region generated a detailed breakdown of multi-head attention with example calculations. The information was accurate at a level comparable to a good blog post — not Ph.D. depth, but solid for a first-pass explanation.

An HN commenter with the handle giobox reported a more impressive result: they asked for a torque spec diagram of their car's suspension and got correct torque figures with the ability to click individual components for more detail. Another user, tristor, confirmed using it for rear subframe and bushing specs.

The flip side: several commenters hit quota errors ("Gemini generateContent request failed: 429") during the HN traffic spike. Others noted that going more than 3 levels deep produced increasingly hallucinated content. One user described the experience as "glacially slow" and compared it unfavorably to Microsoft Encarta CDs from the 1990s — a bit harsh, but not entirely unfair when you're staring at a blank screen for 15-19 seconds between "pages."

What the open-source clone reveals

It's rare to get this level of technical transparency so quickly, but eren23's openflipbook repo is extraordinary. It comes with a detailed STORY.md that reverse-engineers the production Flipbook site through Playwright + bundle inspection, and an ARCHITECTURE.md that maps every file and data flow.

The clone replaces Flipbook's closed stack with BYO-keys: fal.ai for image generation (nano-banana / seedream), OpenRouter for LLMs (Qwen 2.5 VL for click resolution, Qwen 2.5 72B for page planning with web search), Cloudflare R2 for image storage, and MongoDB for session graphs. The license is MIT.

Key architectural differences from the original:

  • Two animation paths, not one. Default is a cheap 5-second MP4 clip from fal-ai/ltx-video (~$0.02 per clip). The streaming LTXF WebSocket protocol is implemented but optional — you deploy ltx_stream.py to your own Modal account and set NEXT_PUBLIC_LTX_WS_URL.
  • Visible status UX. Instead of staring at a blank image for 15 seconds during generation, the SSE stream emits status events ("planning," "drawing") surfaced as an overlay.
  • Shift-drag circle to select. You can draw a freehand stroke around a region instead of just clicking a point. The same VLM resolves the stroke to a subject.
  • Seed-image upload. Drag any image onto the canvas and it becomes the starting page.
  • Precomputed click targets. As soon as a page renders, the VLM precomputes the 3-4 most clickable regions so most taps skip the resolve round-trip.

The tech stack is Next.js 15 + FastAPI on Modal, with 65 Vitest unit tests, 59 pytest tests, and 3 Playwright E2E tests gated behind a PR label to avoid burning API credits.

Why this matters

Flipbook matters less as a product and more as a proof of concept. It demonstrates that a sufficiently good image model can, in many cases, substitute for a UI toolkit. The implications extend beyond novelty:

  1. For education. An HN commenter called it "the future of textbooks." Generating visual explainers on demand, with the ability to click into any sub-topic, maps naturally to exploratory learning. Think interactive Wikipedia rendered as a visual graph rather than hyperlinked text.

  2. For technical reference. The torque-spec-diagram use case isn't a gimmick. Technical documentation is still overwhelmingly text-based, and the effort required to create diagrams means they're rarely updated. A system that generates accurate visual schemas from textual knowledge could change how we think about documentation.

  3. For information discovery. Browsing text requires linear scanning. Browsing images allows peripheral vision and gestalt pattern recognition. Clicking on a visual element is more natural than typing a refined query when you don't quite know what you're looking for.

  4. As a design experiment. Flipbook is essentially saying: the web's layout model (HTML + CSS) is a constraint, not a feature. What would interfaces look like if pixel placement was free? We've been stuck in the "text and colored rectangles" paradigm since NCSA Mosaic (1993). Image-as-UI breaks that assumption.

The problems

The cost is the elephant in the room. Each page generation involves an agentic web search, an LLM planning pass, and an image model inference pass. With the video toggle enabled, you're burning H100-hours for real-time video generation. The site is free because Modal is sponsoring compute — this is not a sustainable unit economics model.

Speed is the second problem. 15-19 seconds between pages is too long for fluid exploration. Progressive rendering (showing a low-quality draft before the final image) helps, but the gap between idea and result is still an order of magnitude too slow for mainstream use.

Hallucination is the third. Image models render text imperfectly — one example from the bundle inspection showed "HANDLEEBRS" instead of "HANDLEBARS" and "speds speeds" instead of "speeds." More critically, going deep into a topic produces increasingly ungrounded content because each new page is conditioned on the previous generated page, not on a real source. Error compounds.

The open-source angle matters more than you'd think

The original Flipbook is closed. The product strategy is straightforward: keep the stack proprietary, raise money, iterate. That's fine. But the open-source clone removes the only thing that made Flipbook unique: not the technology, but the product decision to keep it closed.

eren23's repo proves that the entire paradigm — image-as-UI, click-to-explore, VLM-resolved navigation, video streaming — is commodity infrastructure. The tooling (VLM APIs, image generation, SSE streaming, MSE fMP4 playback) is all available today. You can clone openflipbook, set a few API keys, and have the same experience running on your own infrastructure in under 5 minutes.

This is the real story. Flipbook made the case that image-as-UI is viable. openflipbook made the case that it's accessible.

Where this goes

Flipbook is explicitly an experiment. The FAQ says: "As image and video models become more accurate and performant, Flipbook pages could include more real data, be more interactive, and even take actions and store their own data."

The trajectory is clear: when image generation drops to near-zero cost and latency (and it will, eventually), the HTML-everywhere assumption breaks. Not because images are better than text for everything — they're not — but because they're better for some things, and those things are currently under-served by the web.

For now, Flipbook is a fascinating demo with too much latency and too high a cost to be practical. But if you want to understand where interfaces are heading, spend 20 minutes with it. Then clone openflipbook and spend another 20 minutes reading the architecture. Both are worth your time.

Sources

Flipbook:

openflipbook:

Technology referenced: