The release of DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24, 2026 reset expectations for local large-model deployment. Anyone hoping to grab a small GGUF and run it on a single consumer GPU faces a hard constraint baked into the weights themselves: both models ship in mixed FP4 + FP8 precision, where MoE expert parameters are already at 4-bit. That changes what GGUF quantization can realistically do.
Quick answer: A community GGUF conversion of DeepSeek-V4-Flash exists at roughly 158B-parameter footprint and around 170 GB on disk. No official GGUF is published. Because the source weights are already FP4+FP8, sub-Q4 quantization is unlikely to produce usable output, and full-fat V4-Pro at 1.6T parameters remains out of reach for typical local rigs.
The two models and why size estimates mislead
DeepSeek-V4 is a two-model family. V4-Pro carries 1.6 trillion total parameters with 49B active per token. V4-Flash carries 284B total with 13B active. Both support a 1M-token context window through a hybrid attention design that combines Compressed Sparse Attention and Heavily Compressed Attention.
The wrinkle that matters for GGUF: parameter count does not map cleanly to file size here. MoE expert weights are stored in FP4, while most other parameters use FP8. V4-Flash, despite its 284B parameter label, occupies around 170 GB on disk rather than the ~284 GB a uniformly FP8 model would require. V4-Pro lands near 860–900 GB rather than the naive 1.6 TB estimate.
| Model | Total params | Active params | Approx. disk size | Native precision |
|---|---|---|---|---|
| V4-Flash-Base | 284B | 13B | ~284 GB | FP8 mixed |
| V4-Flash (instruct) | 284B | 13B | ~170 GB | FP4 + FP8 mixed |
| V4-Pro-Base | 1.6T | 49B | ~1.6 TB | FP8 mixed |
| V4-Pro (instruct) | 1.6T | 49B | ~860–900 GB | FP4 + FP8 mixed |
What GGUF availability looks like right now
Shortly after the Hugging Face drop, a community-uploaded conversion of V4-Flash appeared under the repository tecaprovn/deepseek-v4-flash-gguf, registered as a 158B-parameter GGUF using the deepseek2 architecture. It exposes the standard quantization tiers (3-bit through 16-bit) for indexing purposes, but practical usability at sub-Q4 levels is unverified.
No official GGUF release from DeepSeek-AI exists. Unsloth has published a copy of the V4-Flash weights at unsloth/DeepSeek-V4-Flash, which is the typical staging ground before their dynamic GGUF builds appear, but that is the safetensors mirror rather than a converted GGUF. Quantizations of V4-Pro have not surfaced in usable form.
Why aggressive quantization is risky on V4
The instruct variants are already QAT-trained at FP4 for experts. That means the quantization step is not the usual "take a BF16 model and shrink it"; it is "take a model whose experts already live at 4 bits and try to shrink them further." Going below Q4 in GGUF terms compresses weights that have minimal headroom left, which typically produces broken or incoherent output.
Practical implications:
- Q4 GGUF of V4-Flash is roughly the same size as the source weights, so disk savings are minimal.
- Q3 or Q2 GGUF conversions of MXFP4-style experts tend to degrade severely. Expect hallucinations, repetition, or refusal behavior loss before you reach a usable speedup.
- Unlike DeepSeek-V3 era releases, the dramatic 1-bit and 2-bit dynamic quants that worked on FP16/BF16 source weights are not directly translatable here.
Hardware requirements for local inference
V4-Flash is the only realistic local target for most builders, and even that is demanding. The model needs to fit weights plus KV cache plus activation memory.
| Setup | Memory available | Feasibility for V4-Flash |
|---|---|---|
| Single RTX 4090 (24 GB) | 24 GB VRAM | Not viable at any quant |
| Dual RTX 6000 Pro Blackwell | ~192 GB VRAM | Viable at native FP4+FP8 |
| Mac Studio M3 Ultra (256 GB UMA) | 256 GB unified | Viable, with usable context |
| 128 GB DDR5 + 2× RTX 3090 | ~176 GB combined | Tight; needs IQ3/IQ4 GGUF and small context |
| Strix Halo dual setup (~256 GB UMA) | 256 GB unified | Viable |
V4-Pro is effectively out of scope for single-node consumer deployment. Even at ~860 GB, you need multi-node infrastructure or a serious workstation with 1 TB+ of fast memory.
Running V4-Flash GGUF with llama.cpp
Once a GGUF file is local, the runtime path is the standard llama.cpp server flow. The deepseek2 architecture is recognized by recent builds, and the hybrid attention layers compile through the same kernels used for V3.x.
Step 1: Download the GGUF shards into a single directory. V4-Flash at Q4 spans multiple files; keep them together so llama.cpp can stitch them automatically when you point at the first shard.
Step 2: Launch llama-server with the model path, a conservative context size, and KV cache quantization to reduce memory pressure. A starting command:
./llama-server \
--model /path/to/DeepSeekV4Flash-Q4_K_M-00001-of-NN.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--cache-type-k q4_0 \
--cache-type-v f16 \
--threads 16 \
--host 127.0.0.1 \
--port 8080Step 3: Verify the model loads by hitting the /v1/models endpoint and sending a short completion. If output is incoherent or the model emits only special tokens, your prompt formatting is wrong (see the next section).
The chat template gotcha
DeepSeek-V4 ships without a Jinja chat template. That breaks the default behavior of most front-ends, which assume a template is embedded in the tokenizer config. Instead, the model repository includes Python encoding scripts (encoding_dsv4.py) that handle message-to-string conversion and parse reasoning content from completions.
If you feed V4-Flash a generic ChatML or Llama-style prompt, expect quality to collapse. The model is also strict about role ordering: system, user, and assistant turns must follow the exact sequence the encoding script produces. Skipping a turn or merging roles drops effective reasoning sharply.
For llama.cpp users, this means you need to either:
- Pre-format prompts with the official Python encoder and pass raw strings to the completions endpoint, or
- Wait for llama.cpp's
--chat-templatehandling to ship a built-in deepseek-v4 preset.
Reasoning modes and context budget
V4-Flash supports three reasoning effort levels: Non-think, Think High, and Think Max. Think Max is the configuration that pushes the model closest to V4-Pro performance, and it requires a context window of at least 384K tokens to avoid truncation of internal reasoning traces.
For local deployment on constrained hardware, Think Max is rarely practical. A 384K context with V4-Flash blows past the memory budget of most setups even with quantized KV cache. Non-think and Think High at 32K–128K context are the realistic working zones for GGUF inference on workstations.
Recommended sampling parameters from the model card: temperature = 1.0, top_p = 1.0. Lower temperatures degrade reasoning quality on this architecture more than on V3.
What to expect from quality at quantized sizes
Because V4-Flash experts are already FP4-trained, a Q4_K_M GGUF retains most of the source quality. Drops below that are where things get fragile:
| Quant tier | Approx. size (V4-Flash) | Expected quality |
|---|---|---|
| Q8_0 | ~300 GB | Wasteful; experts are upcasted from FP4 |
| Q5_K_M | ~210 GB | Marginal gain over Q4, larger footprint |
| Q4_K_M | ~170 GB | Baseline; matches native precision closely |
| IQ4_XS | ~150 GB | Usable with minor degradation |
| Q3_K_XL | ~130 GB | Noticeable quality loss; coding/reasoning suffer |
| Q2_K and below | <110 GB | Not recommended; output reliability collapses |
This is a sharp departure from V3.2-era guidance, where Q2_K_XL dynamic quants from Unsloth produced workable output. The QAT FP4 experts in V4 leave no room for that kind of compression.
API and hosted alternatives
For users who cannot run V4-Flash locally, the model is accessible through the official DeepSeek API at api.deepseek.com using the OpenAI-compatible format with deepseek-v4-flash or deepseek-v4-pro as the model identifier. The chat interface at chat.deepseek.com exposes V4-Flash as Instant Mode and V4-Pro as Expert Mode.
vLLM has shipped support for the hybrid attention architecture, making it the recommended path for organizations with multi-GPU servers that want to self-host without going through GGUF conversion. The full safetensors weights deploy cleanly through vLLM's tensor parallelism with --tensor-parallel-size.
How to verify your GGUF is working correctly
A correctly loaded V4-Flash GGUF should pass three quick checks. First, a non-think coding prompt like "write a Python function that reverses a string in place" should produce clean code with no trailing garbage or repeated tokens. Second, a Think High math prompt should emit a <think> block followed by a </think> closer and a final answer. Third, a long-context retrieval test at 32K+ should return content from the prompt's start, not just the tail.
If any of these fail, the most common cause is prompt formatting rather than the GGUF itself. Reload with the official encoding script in front of your inference endpoint before assuming the quantization is broken.
The path forward for V4 GGUF on local hardware is narrow but real. V4-Flash at Q4_K_M is the sweet spot, V4-Pro remains a server-class model, and aggressive sub-Q4 quantization is a dead end given the FP4-trained experts. Treat the 170 GB Flash footprint as a floor rather than a target you can shrink past.