Skip to content

DeepSeek V4 GGUF Status: What Runs Locally and What Doesn't

Shivam Malani
DeepSeek V4 GGUF Status: What Runs Locally and What Doesn't

The release of DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24, 2026 reset expectations for local large-model deployment. Anyone hoping to grab a small GGUF and run it on a single consumer GPU faces a hard constraint baked into the weights themselves: both models ship in mixed FP4 + FP8 precision, where MoE expert parameters are already at 4-bit. That changes what GGUF quantization can realistically do.

Quick answer: A community GGUF conversion of DeepSeek-V4-Flash exists at roughly 158B-parameter footprint and around 170 GB on disk. No official GGUF is published. Because the source weights are already FP4+FP8, sub-Q4 quantization is unlikely to produce usable output, and full-fat V4-Pro at 1.6T parameters remains out of reach for typical local rigs.


The two models and why size estimates mislead

DeepSeek-V4 is a two-model family. V4-Pro carries 1.6 trillion total parameters with 49B active per token. V4-Flash carries 284B total with 13B active. Both support a 1M-token context window through a hybrid attention design that combines Compressed Sparse Attention and Heavily Compressed Attention.

The wrinkle that matters for GGUF: parameter count does not map cleanly to file size here. MoE expert weights are stored in FP4, while most other parameters use FP8. V4-Flash, despite its 284B parameter label, occupies around 170 GB on disk rather than the ~284 GB a uniformly FP8 model would require. V4-Pro lands near 860–900 GB rather than the naive 1.6 TB estimate.

ModelTotal paramsActive paramsApprox. disk sizeNative precision
V4-Flash-Base284B13B~284 GBFP8 mixed
V4-Flash (instruct)284B13B~170 GBFP4 + FP8 mixed
V4-Pro-Base1.6T49B~1.6 TBFP8 mixed
V4-Pro (instruct)1.6T49B~860–900 GBFP4 + FP8 mixed

What GGUF availability looks like right now

Shortly after the Hugging Face drop, a community-uploaded conversion of V4-Flash appeared under the repository tecaprovn/deepseek-v4-flash-gguf, registered as a 158B-parameter GGUF using the deepseek2 architecture. It exposes the standard quantization tiers (3-bit through 16-bit) for indexing purposes, but practical usability at sub-Q4 levels is unverified.

No official GGUF release from DeepSeek-AI exists. Unsloth has published a copy of the V4-Flash weights at unsloth/DeepSeek-V4-Flash, which is the typical staging ground before their dynamic GGUF builds appear, but that is the safetensors mirror rather than a converted GGUF. Quantizations of V4-Pro have not surfaced in usable form.


Why aggressive quantization is risky on V4

The instruct variants are already QAT-trained at FP4 for experts. That means the quantization step is not the usual "take a BF16 model and shrink it"; it is "take a model whose experts already live at 4 bits and try to shrink them further." Going below Q4 in GGUF terms compresses weights that have minimal headroom left, which typically produces broken or incoherent output.

Practical implications:

  • Q4 GGUF of V4-Flash is roughly the same size as the source weights, so disk savings are minimal.
  • Q3 or Q2 GGUF conversions of MXFP4-style experts tend to degrade severely. Expect hallucinations, repetition, or refusal behavior loss before you reach a usable speedup.
  • Unlike DeepSeek-V3 era releases, the dramatic 1-bit and 2-bit dynamic quants that worked on FP16/BF16 source weights are not directly translatable here.

Hardware requirements for local inference

V4-Flash is the only realistic local target for most builders, and even that is demanding. The model needs to fit weights plus KV cache plus activation memory.

SetupMemory availableFeasibility for V4-Flash
Single RTX 4090 (24 GB)24 GB VRAMNot viable at any quant
Dual RTX 6000 Pro Blackwell~192 GB VRAMViable at native FP4+FP8
Mac Studio M3 Ultra (256 GB UMA)256 GB unifiedViable, with usable context
128 GB DDR5 + 2× RTX 3090~176 GB combinedTight; needs IQ3/IQ4 GGUF and small context
Strix Halo dual setup (~256 GB UMA)256 GB unifiedViable

V4-Pro is effectively out of scope for single-node consumer deployment. Even at ~860 GB, you need multi-node infrastructure or a serious workstation with 1 TB+ of fast memory.


Running V4-Flash GGUF with llama.cpp

Once a GGUF file is local, the runtime path is the standard llama.cpp server flow. The deepseek2 architecture is recognized by recent builds, and the hybrid attention layers compile through the same kernels used for V3.x.

Step 1: Download the GGUF shards into a single directory. V4-Flash at Q4 spans multiple files; keep them together so llama.cpp can stitch them automatically when you point at the first shard.

Step 2: Launch llama-server with the model path, a conservative context size, and KV cache quantization to reduce memory pressure. A starting command:

./llama-server \
  --model /path/to/DeepSeekV4Flash-Q4_K_M-00001-of-NN.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --cache-type-k q4_0 \
  --cache-type-v f16 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8080

Step 3: Verify the model loads by hitting the /v1/models endpoint and sending a short completion. If output is incoherent or the model emits only special tokens, your prompt formatting is wrong (see the next section).


The chat template gotcha

DeepSeek-V4 ships without a Jinja chat template. That breaks the default behavior of most front-ends, which assume a template is embedded in the tokenizer config. Instead, the model repository includes Python encoding scripts (encoding_dsv4.py) that handle message-to-string conversion and parse reasoning content from completions.

If you feed V4-Flash a generic ChatML or Llama-style prompt, expect quality to collapse. The model is also strict about role ordering: system, user, and assistant turns must follow the exact sequence the encoding script produces. Skipping a turn or merging roles drops effective reasoning sharply.

For llama.cpp users, this means you need to either:

  • Pre-format prompts with the official Python encoder and pass raw strings to the completions endpoint, or
  • Wait for llama.cpp's --chat-template handling to ship a built-in deepseek-v4 preset.

Reasoning modes and context budget

V4-Flash supports three reasoning effort levels: Non-think, Think High, and Think Max. Think Max is the configuration that pushes the model closest to V4-Pro performance, and it requires a context window of at least 384K tokens to avoid truncation of internal reasoning traces.

For local deployment on constrained hardware, Think Max is rarely practical. A 384K context with V4-Flash blows past the memory budget of most setups even with quantized KV cache. Non-think and Think High at 32K–128K context are the realistic working zones for GGUF inference on workstations.

Recommended sampling parameters from the model card: temperature = 1.0, top_p = 1.0. Lower temperatures degrade reasoning quality on this architecture more than on V3.


What to expect from quality at quantized sizes

Because V4-Flash experts are already FP4-trained, a Q4_K_M GGUF retains most of the source quality. Drops below that are where things get fragile:

Quant tierApprox. size (V4-Flash)Expected quality
Q8_0~300 GBWasteful; experts are upcasted from FP4
Q5_K_M~210 GBMarginal gain over Q4, larger footprint
Q4_K_M~170 GBBaseline; matches native precision closely
IQ4_XS~150 GBUsable with minor degradation
Q3_K_XL~130 GBNoticeable quality loss; coding/reasoning suffer
Q2_K and below<110 GBNot recommended; output reliability collapses

This is a sharp departure from V3.2-era guidance, where Q2_K_XL dynamic quants from Unsloth produced workable output. The QAT FP4 experts in V4 leave no room for that kind of compression.


API and hosted alternatives

For users who cannot run V4-Flash locally, the model is accessible through the official DeepSeek API at api.deepseek.com using the OpenAI-compatible format with deepseek-v4-flash or deepseek-v4-pro as the model identifier. The chat interface at chat.deepseek.com exposes V4-Flash as Instant Mode and V4-Pro as Expert Mode.

vLLM has shipped support for the hybrid attention architecture, making it the recommended path for organizations with multi-GPU servers that want to self-host without going through GGUF conversion. The full safetensors weights deploy cleanly through vLLM's tensor parallelism with --tensor-parallel-size.


How to verify your GGUF is working correctly

A correctly loaded V4-Flash GGUF should pass three quick checks. First, a non-think coding prompt like "write a Python function that reverses a string in place" should produce clean code with no trailing garbage or repeated tokens. Second, a Think High math prompt should emit a <think> block followed by a </think> closer and a final answer. Third, a long-context retrieval test at 32K+ should return content from the prompt's start, not just the tail.

If any of these fail, the most common cause is prompt formatting rather than the GGUF itself. Reload with the official encoding script in front of your inference endpoint before assuming the quantization is broken.

The path forward for V4 GGUF on local hardware is narrow but real. V4-Flash at Q4_K_M is the sweet spot, V4-Pro remains a server-class model, and aggressive sub-Q4 quantization is a dead end given the FP4-trained experts. Treat the 170 GB Flash footprint as a floor rather than a target you can shrink past.