aiGalen Guan

Running Carnice-V2-27B Locally on RTX 4090 with Ollama — A Complete Guide

On April 25, 2026, developer kai-os released Carnice-V2-27B on Hugging Face — a fine-tuned variant of Qwen/Qwen3.6-27B optimized for Hermes-style agent traces. Within days, GGUF quantizations were made available by the community, making it possible to run this model on consumer-grade GPUs.

This guide covers everything you need to know about the model, how to choose the right quantization for your hardware, and the exact steps to get it running locally with Ollama.


1. What is Carnice-V2-27B?

Carnice-V2-27B is a supervised fine-tuned (SFT) model built on top of Qwen3.6-27B. Its key characteristics:

Attribute Detail
Base model Qwen/Qwen3.6-27B
Architecture Qwen3.5 hybrid (attention + SSM layers)
GGUF type qwen35
Model size 27 billion parameters
Pipeline Image-text-to-text (multimodal, vision encoder available) (vision pipeline present from base Qwen3.6 but not validated after agent SFT; treat as experimental)
License Apache-2.0
Chat format ChatML (`<
Primary use case Agentic AI agent traces, tool calling, structured reasoning
Release date April 25, 2026

Why Carnice?

The model is specifically tuned for agentic workflows — tool calling, multi-step reasoning, and structured outputs following the Hermes agent paradigm. Benchmarks show significant improvements over the base Qwen3.6-27B on IFEval (instruction following) with prompt strict scores rising from 85.0% to 90.0%, and instruction strict from 90.0% to 93.3%.


2. Hardware Requirements

The RTX 4090 has 24GB of VRAM, which is sufficient to run Carnice-V2-27B with appropriate quantization. Here's the quantization ladder:

Quantization File Size Fits RTX 4090 24GB? Quality
bf16 51 GB ❌ No Reference (full quality)
Q8_0 27 GB ❌ No (but possible with CPU offload) Near-lossless
Q5_K_M ~18 GB ✅ Yes (best quality) Excellent
Q4_K_M ~16 GB ✅ Yes (best balance) Very good
Q2_K ~10 GB ✅ Yes Acceptable
IQ2_M ~9.4 GB ✅ Yes Low but usable

Note: Q5_K_M offers the best quality but Q4_K_M is recommended for 24GB cards with long-context needs — it leaves more VRAM headroom (~8 GB).

Our pick: Q4_K_M (16 GB) — leaves ~8 GB for KV cache and context, enabling longer conversations without swapping.

Note: The GGUF file uses the qwen35 architecture with hybrid attention/SSM layers. This requires a recent version of llama.cpp (build b8919 or later). Ollama 0.21.0 supports it without issues.


3. Installation Steps

Step 1: Search Ollama Library

The model is not directly available in the Ollama registry. Searching on ollama.com reveals gurubot/Carnice-27b-GGUF and anton96vice/carnice entries, but neither can be directly pulled. Both ollama search (unsupported in v0.21.0) and ollama pull carnice-v2-27b return "file does not exist."

Step 2: Download GGUF from Hugging Face

Choose your quantization and download from kai-os/Carnice-V2-27b-GGUF:

cd ~/workspace
wget https://huggingface.co/kai-os/Carnice-V2-27b-GGUF/resolve/main/carnice-v2-27b-Q4_K_M.gguf

The Q4_K_M file is approximately 16 GB. At ~10 MB/s, expect about 25 minutes of download time.

Step 3: Create a Modelfile

The model uses ChatML format with a leading <think> tag for chain-of-thought reasoning:

FROM ./carnice-v2-27b-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
<think>
"""

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 8192

Step 4: Import into Ollama

ollama create carnice-v2-27b -f Modelfile-carnice

This operation copies the GGUF file into Ollama's storage (creates a symlink or copies depending on Ollama version), parses the metadata, and registers the model. It completes in roughly 30-60 seconds.

Step 5: Verify

ollama list | grep carnice
# carnice-v2-27b:latest      b6cd2ae19e4a    16 GB     ...

# Quick test
ollama run carnice-v2-27b --nowordwrap "Hello, what can you do?"

4. First Run Observations

The model exhibits Qwen3.6's characteristic <think> reasoning block before its final response. On first run:

  • Token generation speed: ~30-40 tokens/sec on RTX 4090 (Q4_K_M)
  • Context window: 8192 tokens comfortably (configurable up to 32K+)
  • VRAM usage: ~14-16 GB during inference
  • Format: The model outputs structured reasoning inside <think>...</think>, followed by the final answer

Sample behavior:

<think>
[Model reasons step-by-step within this block]
</think>
Final answer to the user.

This dual-output format is particularly useful for debugging agent behavior — you can see why the model chose a particular action before seeing the action itself.


5. Performance Tuning Tips

  • Context length: GGUF metadata reports up to 262K tokens from the base Qwen3.6 config, but realistic usable context on 24GB VRAM is ~32K with Q4_K_M. Native training context is ~32K.
  • GPU layers: Leave num_gpu_layers at default (all layers on GPU) on RTX 4090.
  • Batch size: Default works fine for interactive use. For batch inference, increase batch size.
  • Flash attention: If your llama.cpp/Ollama build supports it, enable flash attention for memory-efficient long-context inference.

6. Troubleshooting

Symptom Likely Cause Fix
file does not exist on pull Not in Ollama registry Use manual GGUF import (Step 2-4)
unknown architecture Outdated llama.cpp Update Ollama to latest version
CUDA out of memory Quant too large Try Q4_K_M or Q2_K instead of Q5_K_M/Q8_0
Model outputs gibberish Wrong template Ensure ChatML format with `<
No <think> block Template missing <think> Add <think> after `<

7. Alternative Runtimes

If Ollama doesn't work for your setup, these alternatives are also compatible:

  1. LM Studio — Already installed on this system (~/.lmstudio/bin). Typically ships with the latest llama.cpp, offering the best chance of compatibility with new GGUF architectures.
  2. llama.cpp directly — Build from source for the bleeding-edge architecture support.
  3. koboldcpp — Single-file executable, good for Windows users.
  4. vLLM — Production serving with PagedAttention for high throughput.

Summary

Carnice-V2-27B is a powerful agent-tuned model that runs well on consumer hardware with appropriate quantization. On an RTX 4090 with 24 GB VRAM, the Q4_K_M quantization offers the best balance of quality and resource usage. While the model isn't available directly through Ollama's registry, the manual GGUF import process is straightforward and takes just a few minutes.

The growing ecosystem of Hermes-style agent models — trained on structured agent trajectories — represents an important trend in open-source AI. Models like Carnice-V2-27B demonstrate that fine-tuning on high-quality agent interaction data can significantly improve instruction-following and tool-use capabilities beyond what the base model provides.


Links: