aiGalen Guan

Running NVIDIA Nemotron Nano Omni 30B on a Single RTX 4090

NVIDIA released Nemotron 3 Nano Omni on April 28, 2026 — less than a week before this writing. The headline is big: it's the first open-weight 30B-class model that natively handles video, audio, image, and text in a single architecture. No stitching together separate ASR and vision pipelines. One model, four modalities.

But the real question is: can you actually run this thing on a consumer GPU? The official requirements list an H100 80GB for BF16, an L40S 48GB for FP8, and an RTX 5090 32GB for NVFP4. The RTX 4090 — still the most popular high-end consumer card at 24GB — doesn't make the cut.

I spent a morning finding out whether that's actually true.

The Architecture: What Makes It Different

Nemotron Nano Omni isn't just a VLM with audio bolted on. It's a three-component system built on NVIDIA's own hardware-aware research:

LLM Backbone: Nemotron 3 Nano 30B-A3B — A Mamba2-Transformer hybrid Mixture of Experts. Total parameters: 31 billion. Active per token: only ~3 billion. This is the same backbone powering NVIDIA's text-only Nemotron Nano family, and it's what makes the model surprisingly fast on consumer hardware despite its 30B label. Mamba2's linear-time attention lets it chew through long contexts without quadratic memory blowup, while the Transformer layers handle precision reasoning.

Vision Encoder: CRADIO v4-H — The "H" variant is NVIDIA's high-resolution vision encoder, handling both static images and video frames. It feeds visual tokens into the LLM, supporting up to 2 minutes of video at 1080p (1 fps sampling) or 720p (2 fps). A token pruning technique drops 50% of redundant visual tokens, halving prefill latency.

Speech Encoder: Parakeet tdt-0.6b — NVIDIA's in-house CTC/RNN-T encoder supporting up to one hour of audio. It produces word-level timestamps during transcription, a detail that matters for enterprise meeting intelligence workflows.

The output is text-only — no image or audio generation. But within that text output, the model supports reasoning chains (CoT), JSON structured output, tool calling, and ASR timestamps.

The Qwen3 Connection

Here's something the README states openly that many reviewers missed: Nemotron Nano Omni was "improved using Qwen3-VL-30B-A3B-Instruct" — along with Qwen3.5-122B, Qwen2.5-VL-72B, and gpt-oss-120b.

This isn't a from-scratch NVIDIA creation. It's a distillation + enhancement play. NVIDIA took Qwen3-VL's 30B-A3B architecture, replaced its ViT vision encoder with CRADIO v4-H, added the Parakeet audio pathway, swapped in Mamba2 hybrid layers, and retrained on NVIDIA's proprietary Nemotron datasets. The result is a model that inherits Qwen3-VL's strong visual reasoning while gaining native audio and video understanding the original never had.

Benchmarks: The Numbers

NVIDIA published results across 14 multimodal benchmarks. Here are the highlights in non-reasoning mode:

Benchmark BF16 FP8 NVFP4
MathVista_MINI 71.9 71.1 71.3
OCRBenchV2 (EN) 65.8 65.6 65.8
Video MME 70.8 69.4 69.6
Daily Omni 74.5 74.1 74.2
CVBench2D 84.2 85.6 85.3
Mean (9 non-ASR) 65.8 65.4 65.4

The quantization story is impressive: FP8 loses only 0.4 points on average, and NVFP4 loses 0.38. You can drop from 61.5 GB to 20.9 GB with negligible accuracy degradation — if you have the hardware to run those formats.

For ASR, Tedium Long hits 3.11% WER and HF-ASR hits 5.95%. Both variants are within 0.03 points of BF16.

The Reality of Running It on an RTX 4090

My test setup: RTX 4090 24GB, Ollama 0.21.0, Ubuntu Linux.

Step 1: Find the Right Quantization

The official NVFP4 format requires Blackwell (RTX 5090) architecture. GGUF is the only viable path for a 4090. I used Unsloth's IQ3_S quant from their extensive quantization matrix — 17.5 GB for the model plus 1.5 GB for the multimodal projector, totaling roughly 19 GB. That leaves about 3.5 GB for KV cache, which is tight but workable at 8192 context.

Step 2: Import Into Ollama

The model isn't in Ollama's registry yet. The GGUF import path works smoothly:

ollama create nemotron-nano-omni-text -f Modelfile
# Model copies into Ollama's blob store, GGUF file can be deleted after

Load time: 9.27 seconds cold start. Subsequent runs are instant.

Step 3: Text Benchmark Results

I tested across several tasks. Here are the real numbers:

Load duration:         9.27s
Prompt evaluation:     673-1073 tokens/s (varies with prompt length)
Text generation:       196-201 tokens/s  ← consistently in this range
GPU utilization:       91%
VRAM consumption:      ~21.4 GB (model + KV cache)

A 30B MoE generating at 200 tokens per second on a single consumer GPU is genuinely impressive. For comparison, running a dense 7B model like Qwen2.5-7B on the same hardware hits around 110-130 t/s. The MoE architecture's 3B active parameter count is doing the heavy lifting here.

What It Handled Well

  • English technical reasoning: Strong. The model produced coherent Chain-of-Thought reasoning, complete with self-verification steps. When asked to explain quantum computing in three sentences, it drafted a mental checklist, verified each sentence against the constraint, then output clean results.

  • Chinese capability: Surprisingly good despite the model being "English only" per the README. When asked to explain attention mechanisms in Chinese and provide PyTorch code, it delivered fluent Chinese with properly formatted mathematical notation and working Python. The response quality rivaled dedicated Chinese models. This likely inherits from Qwen3-VL's multilingual pretraining, but NVIDIA's post-training on English data didn't degrade Chinese retention.

  • Code generation: Solid. Standard Python implementations with correct variable naming and docstrings. No hallucinations on standard algorithms.

The Multimodal Limitation

I attempted multimodal inference through Ollama's --image flag and through llama.cpp's server mode. Neither path works for this model in its current GGUF state on consumer hardware.

The reason is architectural: Nemotron Nano Omni's multimodal pipeline requires three separate encoder paths (CRADIO vision + Parakeet audio + the LLM backbone) working in concert. The GGUF format can store the model weights for all three components — and the downloaded mmproj file contains the vision projector weights — but Ollama's current multimodal support is limited to the LLaVA-style single-projection architecture. The custom encoder pipeline used by Nemotron Nano Omni requires either vLLM 0.20.0 (which demands H100/L40S-class hardware) or a mature llama.cpp integration that's still being developed.

This is the honest reality of cutting-edge model deployment: text mode works beautifully on consumer hardware today. Multimodal requires patience — or a cloud GPU.

What This Means

Nemotron Nano Omni represents a genuine step forward in several ways:

Architecture innovation shipping as open-weight: The Mamba2-Transformer hybrid design — where state-space model layers and attention layers alternate — has been a research topic for two years. This is the first major open release proving it works at scale. The 31B/3B MoE ratio means consumer GPUs get "big model" capability at "small model" compute cost.

The distillation strategy is worth studying: NVIDIA's approach of starting from Qwen3-VL, swapping proprietary encoders, and retraining on internal data is a template other companies will follow. It's faster than training from scratch and produces differentiated models with unique capabilities (native audio, in this case) that the base model never had.

Consumer hardware is closer than the spec sheet suggests: When a 30B multimodal model runs at 200 t/s on a 24GB card, the gap between "enterprise" and "local" AI is narrowing faster than the official requirements imply. What NVIDIA calls "H100 minimum" for full-precision is accurate for the complete multimodal stack — but the text backbone alone is already useful at consumer-friendly quantizations.

Multimodal is still a moving target: The tooling ecosystem for multimodal GGUF models is fragmented. Ollama, llama.cpp, and LM Studio are all converging on better multimodal support, but models with custom encoder pipelines (like CRADIO + Parakeet) remain challenging. The gap between "download and run" for text models and "download and debug" for multimodal is still real.

Should You Try It?

If you have a 24GB GPU and want to experience a 30B MoE with strong reasoning and Chinese capability: yes. The IQ3_S quant runs beautifully at 200 t/s. Import into Ollama, test it, and you'll have a capable reasoning model that punches above its 3B-per-token weight class.

If you need the full video + audio + image multimodal stack today: be prepared to provision an H100 or L40S instance. Or wait for llama.cpp's Parakeet integration to land — the PRs are open and active.

The model itself is a fascinating proof point: NVIDIA shipping open-weight frontier research, built on community foundations (Qwen), enhanced with proprietary innovations (Mamba2, CRADIO, Parakeet). It's a model ecosystem where the lines between "open" and "proprietary" are blurring in productive ways.