Kimi K2.6: What Moonshot AI’s new open model actually does

Kimi K2.6 is Moonshot AI’s latest open-source model, a Mixture-of-Experts system with 1 trillion total parameters and 32 billion active per token. It ships with open weights on Hugging Face under a Modified MIT license, native INT4 quantization, and a 256K context window, and it’s aimed squarely at long-horizon coding, agentic workflows, and coding-driven design.

⚡

Quick answer: K2.6 is a 1T-parameter MoE model (32B active) with native INT4 weights, a 256K context window, and multimodal input. It runs on vLLM, SGLang, and KTransformers, and is accessible through Moonshot’s API plus the Kimi Code CLI.

What Kimi K2.6 is

K2.6 is the successor to K2.5 and shares the same underlying architecture, which means existing K2.5 deployments can swap in the new weights without reconfiguring their inference stack. Moonshot describes it as a native multimodal agentic model with a focus on four practical capabilities: long-horizon coding across languages like Rust, Go, and Python; coding-driven design that turns prompts and images into working interfaces; an elevated agent swarm that can coordinate up to 300 sub-agents over 4,000 steps; and proactive orchestration for persistent background agents.

The model is available on Hugging Face with full weights, and Moonshot also runs a hosted API at platform.moonshot.ai that’s compatible with both OpenAI and Anthropic client SDKs.

Architecture and specs

Spec	Value
Architecture	Mixture-of-Experts (MoE)
Total parameters	1T
Active parameters per token	32B
Layers (incl. dense)	61
Attention heads	64
Experts	384 (8 selected + 1 shared per token)
Attention mechanism	MLA (Multi-head Latent Attention)
Activation	SwiGLU
Vocabulary	160K
Context length	256K tokens
Vision encoder	MoonViT (400M params)

The sparse expert routing is the key efficiency lever. Only 32 billion of the 1 trillion parameters fire for any given token, which keeps per-token compute cost closer to a mid-size dense model while giving the system a much larger knowledge base to draw from.

Benchmark performance

Moonshot reports K2.6 with thinking mode enabled and compares it to GPT-5.4 at xhigh reasoning, Claude Opus 4.6 at max effort, and Gemini 3.1 Pro at high thinking. The headline numbers place it competitively at the frontier on agentic and coding tasks, while trailing slightly on some pure-reasoning benchmarks.

Benchmark	K2.6	GPT-5.4	Opus 4.6	Gemini 3.1 Pro
HLE-Full (w/ tools)	54.0	52.1	53.0	51.4
BrowseComp	83.2	82.7	83.7	85.9
BrowseComp (Agent Swarm)	86.3	—	—	—
DeepSearchQA (accuracy)	83.0	63.7	80.6	60.2
SWE-Bench Verified	80.2	—	80.8	80.6
SWE-Bench Pro	58.6	57.7	53.4	54.2
Terminal-Bench 2.0	66.7	65.4	65.4	68.5
LiveCodeBench v6	89.6	—	88.8	91.7
AIME 2026	96.4	99.2	96.7	98.3
GPQA-Diamond	90.5	92.8	91.3	94.3
MMMU-Pro	79.4	81.2	73.9	83.0

The agent swarm configuration on BrowseComp, where K2.6 jumps to 86.3, is a specific capability Moonshot highlights. The model can fan out to hundreds of sub-agents to parallelize information gathering, which is difficult to replicate with closed models that restrict parallel tool use.

Native INT4 quantization

One of the more interesting technical choices is Quantization-Aware Training for the INT4 variant. Rather than compressing weights after training (post-training quantization), K2.6’s INT4 model is trained with the quantization constraints in the loop. The practical effect is roughly 2x faster inference compared to FP16, about 50% less GPU memory, and benchmark scores that stay within 1–2% of the full-precision baseline.

The INT4 weights are around 594GB on Hugging Face, versus roughly 2TB for FP16. That changes the hardware math significantly.

Precision	Model size	Min GPU memory	Typical config
FP16 / BF16	~2TB	~640GB+ VRAM	8× H100 80GB
FP8	~1TB	~320GB+ VRAM	8× A100 80GB
INT4 (QAT)	~594GB	~320GB+ VRAM	4× H100 80GB

Self-hosting options

Three inference engines officially support K2.6: vLLM, SGLang, and KTransformers. All three require transformers>=4.57.1,<5.0.0 and expose an OpenAI-compatible chat completions endpoint.

vLLM is the most general-purpose choice, with PagedAttention and continuous batching for high-throughput serving. A typical INT4 launch looks like this:


python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.6-INT4 \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --trust-remote-code \
  --port 8000

SGLang is built for structured generation, constrained decoding, and multi-turn workloads. Its RadixAttention caches KV state across conversation turns, which tends to help agentic loops where the same system prompt and tool definitions repeat.

KTransformers is Moonshot’s first-party engine, tuned specifically for K2’s MoE routing pattern and MLA attention. It also supports CPU offloading of inactive experts, which can lower the total GPU VRAM requirement for teams that don’t have a full 4× or 8× H100 node available.

Thinking vs Instant mode

K2.6 exposes two generation modes. Thinking mode produces a visible reasoning trace before the final answer and is tuned for complex reasoning, multi-step coding, and agentic tasks. Instant mode skips the reasoning trace for faster, lower-overhead responses on straightforward queries.

Parameter	Thinking	Instant
Temperature	1.0	0.6
top_p	0.95	0.95
thinking flag	True (default)	False

On vLLM or SGLang, you switch to Instant mode by passing chat_template_kwargs: {"thinking": False} in the request body. On Moonshot’s official API, the equivalent is thinking: {"type": "disabled"}.

How it’s accessed today

At launch, K2.6 is labeled as a code preview in Moonshot’s developer console and is primarily reached through the Kimi Code CLI. The standard Kimi web chat at kimi.com still routes the general agent to K2.5, which has caused some confusion for users who expect to pick K2.6 from a model dropdown. Inside the Kimi Code console, opting into the beta program exposes the flagship as k2.6-code-preview.

There’s also a quirk around authentication: the K2.6 preview has been available to OAuth users of Kimi Code, while API-key auth paths have sometimes been limited to K2.5. This behavior may change as the preview graduates, but it’s worth testing both auth flows if K2.6 doesn’t appear where expected.

For users who want a hosted agent setup without running their own CLI, Moonshot’s Kimi Claw feature provides a one-click deployment that wires the K2.6 coding plan into a cloud-hosted OpenClaw environment, including messaging-app connectors. K2.6’s subscription plans are priced significantly lower than equivalent per-token API usage on Claude or GPT-class models, which is the main draw for developers running high-volume coding agents.

Cost tradeoffs for self-hosting

The break-even point between Moonshot’s API and self-hosted infrastructure depends almost entirely on monthly token volume. Self-hosting on a 4× H100 INT4 node runs roughly $8,000–$12,000 per month in cloud GPU costs, versus API pricing that scales linearly with usage.

Monthly volume	API cost (est.)	4× H100 INT4
10M tokens	~$15–$30	~$8,000–$12,000
500M tokens	~$750–$1,500	~$8,000–$12,000
5B tokens	~$7,500–$15,000	~$8,000–$12,000
20B+ tokens	~$30,000–$60,000	~$8,000–$12,000

Below roughly 5 billion tokens per month, the API is cheaper. Above that, self-hosting on INT4 can save 60–80% while also giving teams data sovereignty, custom batching, and no rate limits.

Where K2.6 fits

K2.6 is best understood as an open-weights alternative to Claude Opus and GPT-5-class models for coding and agent workloads, with two specific advantages: the weights are freely redistributable under a Modified MIT license, and the model plugs into third-party agent frameworks like OpenClaw and Hermes that closed APIs have been restricting. The tradeoffs are a smaller context window than Claude’s 1M-token ceiling, no polished desktop app at the level of Claude Code, and a coding speed that trails Opus 4.7 in side-by-side tests.

For teams building agent swarms, running high-volume coding pipelines, or needing on-prem deployment, the combination of native INT4, MoE efficiency, and open weights makes K2.6 one of the more practical frontier-class models to actually deploy right now.