Running Carnice-V2-27B Locally on RTX 4090 with Ollama — A Complete Guide
On April 25, 2026, developer kai-os released Carnice-V2-27B on Hugging Face — a fine-tuned variant of Qwen/Qwen3.6-27B optimized for Hermes-style agent traces. Within days, GGUF quantizations were made available by the community, making it possible to run this model on consumer-grade GPUs.
This guide covers everything you need to know about the model, how to choose the right quantization for your hardware, and the exact steps to get it running locally with Ollama.
1. What is Carnice-V2-27B?
Carnice-V2-27B is a supervised fine-tuned (SFT) model built on top of Qwen3.6-27B. Its key characteristics:
| Attribute | Detail |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Architecture | Qwen3.5 hybrid (attention + SSM layers) |
| GGUF type | qwen35 |
| Model size | 27 billion parameters |
| Pipeline | Image-text-to-text (multimodal, vision encoder available) (vision pipeline present from base Qwen3.6 but not validated after agent SFT; treat as experimental) |
| License | Apache-2.0 |
| Chat format | ChatML (`< |
| Primary use case | Agentic AI agent traces, tool calling, structured reasoning |
| Release date | April 25, 2026 |
Why Carnice?
The model is specifically tuned for agentic workflows — tool calling, multi-step reasoning, and structured outputs following the Hermes agent paradigm. Benchmarks show significant improvements over the base Qwen3.6-27B on IFEval (instruction following) with prompt strict scores rising from 85.0% to 90.0%, and instruction strict from 90.0% to 93.3%.
2. Hardware Requirements
The RTX 4090 has 24GB of VRAM, which is sufficient to run Carnice-V2-27B with appropriate quantization. Here's the quantization ladder:
| Quantization | File Size | Fits RTX 4090 24GB? | Quality |
|---|---|---|---|
| bf16 | 51 GB | ❌ No | Reference (full quality) |
| Q8_0 | 27 GB | ❌ No (but possible with CPU offload) | Near-lossless |
| Q5_K_M | ~18 GB | ✅ Yes (best quality) | Excellent |
| Q4_K_M | ~16 GB | ✅ Yes (best balance) | Very good |
| Q2_K | ~10 GB | ✅ Yes | Acceptable |
| IQ2_M | ~9.4 GB | ✅ Yes | Low but usable |
Note: Q5_K_M offers the best quality but Q4_K_M is recommended for 24GB cards with long-context needs — it leaves more VRAM headroom (~8 GB).
Our pick: Q4_K_M (16 GB) — leaves ~8 GB for KV cache and context, enabling longer conversations without swapping.
Note: The GGUF file uses the
qwen35architecture with hybrid attention/SSM layers. This requires a recent version of llama.cpp (build b8919 or later). Ollama 0.21.0 supports it without issues.
3. Installation Steps
Step 1: Search Ollama Library
The model is not directly available in the Ollama registry. Searching on ollama.com reveals gurubot/Carnice-27b-GGUF and anton96vice/carnice entries, but neither can be directly pulled. Both ollama search (unsupported in v0.21.0) and ollama pull carnice-v2-27b return "file does not exist."
Step 2: Download GGUF from Hugging Face
Choose your quantization and download from kai-os/Carnice-V2-27b-GGUF:
cd ~/workspace
wget https://huggingface.co/kai-os/Carnice-V2-27b-GGUF/resolve/main/carnice-v2-27b-Q4_K_M.gguf
The Q4_K_M file is approximately 16 GB. At ~10 MB/s, expect about 25 minutes of download time.
Step 3: Create a Modelfile
The model uses ChatML format with a leading <think> tag for chain-of-thought reasoning:
FROM ./carnice-v2-27b-Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
<think>
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 8192
Step 4: Import into Ollama
ollama create carnice-v2-27b -f Modelfile-carnice
This operation copies the GGUF file into Ollama's storage (creates a symlink or copies depending on Ollama version), parses the metadata, and registers the model. It completes in roughly 30-60 seconds.
Step 5: Verify
ollama list | grep carnice
# carnice-v2-27b:latest b6cd2ae19e4a 16 GB ...
# Quick test
ollama run carnice-v2-27b --nowordwrap "Hello, what can you do?"
4. First Run Observations
The model exhibits Qwen3.6's characteristic <think> reasoning block before its final response. On first run:
- Token generation speed: ~30-40 tokens/sec on RTX 4090 (Q4_K_M)
- Context window: 8192 tokens comfortably (configurable up to 32K+)
- VRAM usage: ~14-16 GB during inference
- Format: The model outputs structured reasoning inside
<think>...</think>, followed by the final answer
Sample behavior:
<think>
[Model reasons step-by-step within this block]
</think>
Final answer to the user.
This dual-output format is particularly useful for debugging agent behavior — you can see why the model chose a particular action before seeing the action itself.
5. Performance Tuning Tips
- Context length: GGUF metadata reports up to 262K tokens from the base Qwen3.6 config, but realistic usable context on 24GB VRAM is ~32K with Q4_K_M. Native training context is ~32K.
- GPU layers: Leave
num_gpu_layersat default (all layers on GPU) on RTX 4090. - Batch size: Default works fine for interactive use. For batch inference, increase batch size.
- Flash attention: If your llama.cpp/Ollama build supports it, enable flash attention for memory-efficient long-context inference.
6. Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
file does not exist on pull |
Not in Ollama registry | Use manual GGUF import (Step 2-4) |
unknown architecture |
Outdated llama.cpp | Update Ollama to latest version |
CUDA out of memory |
Quant too large | Try Q4_K_M or Q2_K instead of Q5_K_M/Q8_0 |
| Model outputs gibberish | Wrong template | Ensure ChatML format with `< |
No <think> block |
Template missing <think> |
Add <think> after `< |
7. Alternative Runtimes
If Ollama doesn't work for your setup, these alternatives are also compatible:
- LM Studio — Already installed on this system (
~/.lmstudio/bin). Typically ships with the latest llama.cpp, offering the best chance of compatibility with new GGUF architectures. - llama.cpp directly — Build from source for the bleeding-edge architecture support.
- koboldcpp — Single-file executable, good for Windows users.
- vLLM — Production serving with PagedAttention for high throughput.
Summary
Carnice-V2-27B is a powerful agent-tuned model that runs well on consumer hardware with appropriate quantization. On an RTX 4090 with 24 GB VRAM, the Q4_K_M quantization offers the best balance of quality and resource usage. While the model isn't available directly through Ollama's registry, the manual GGUF import process is straightforward and takes just a few minutes.
The growing ecosystem of Hermes-style agent models — trained on structured agent trajectories — represents an important trend in open-source AI. Models like Carnice-V2-27B demonstrate that fine-tuning on high-quality agent interaction data can significantly improve instruction-following and tool-use capabilities beyond what the base model provides.
Links: