DGX Spark + Mac Studio: Disaggregated LLM Inference With EXO

Pairing an NVIDIA DGX Spark with an Apple Mac Studio (M3 Ultra) lets a single large language model run faster than either machine can manage alone. The trick is splitting the request: the Spark handles the compute-heavy prefill, the Mac Studio handles the memory-bandwidth-heavy decode, and the KV cache streams between them while the work is still in flight. EXO Labs measured a 2.8x end-to-end speedup on Llama-3.1 8B at an 8,192-token prompt using EXO 1.0 to orchestrate the cluster.

Quick answer: Run prefill on the DGX Spark (≈100 TFLOPs FP16, 128 GB at 273 GB/s), stream the KV cache layer-by-layer over the network, and run decode on the Mac Studio M3 Ultra (≈26 TFLOPs FP16, 512 GB at 819 GB/s). EXO 1.0 detects both devices, profiles them, and assigns phases automatically.

Why one box is the wrong unit

Two numbers shape the user-visible experience of a local LLM. Time-to-first-token (TTFT) is set by the prefill phase, where the model ingests the prompt and writes the KV cache. Tokens-per-second (TPS) is set by the decode phase, where the model generates one token at a time while reading that cache.

Prefill is compute-bound. With Flash Attention, the data movement scales linearly with prompt length while the compute scales quadratically, which pushes arithmetic intensity high enough that the GPU's matrix throughput dominates. Decode is memory-bound. Each token requires a vector-matrix multiplication against the full KV cache, so memory bandwidth, not FLOPs, sets the ceiling.

The DGX Spark and the Mac Studio are mirror images of each other on those two axes. The Spark has roughly 4x the FP16 compute of the M3 Ultra. The M3 Ultra has roughly 3x the memory bandwidth of the Spark. Running the whole request on either machine wastes one of those advantages.

How disaggregated prefill and decode works

The naive split is to run prefill on the high-compute device, wait for it to finish, ship the entire KV cache across the network, then start decode on the high-bandwidth device. That works, but the transfer sits on the critical path and eats the speedup.

EXO's pipeline streams the KV cache layer by layer. As soon as layer 1's KV vectors are computed on the Spark, they begin transferring to the Mac Studio while layer 2's prefill starts on the Spark. Communication for each layer overlaps with computation of later layers, so the network cost hides behind compute that was going to happen anyway. Decode begins on the Mac Studio the moment the final layer arrives.

Communication is fully hidden when per-layer compute time exceeds per-layer transfer time. With a 10 GbE link between the two boxes and the Spark at 100 TFLOPs FP16, the compute-to-bandwidth ratio is 10,000. The required prompt length depends on the attention architecture, captured by a constant K that is larger for grouped-query attention models.

Model	Attention type	K	Min context to fully hide transfer (8-bit KV)
Llama-2 7B	MHA	2	~40k tokens
Llama-3 8B	GQA	8	~10k tokens
Llama-3 70B / Qwen-2.5 72B	GQA	16	~5k tokens

Below those thresholds the technique still works, but some of the KV transfer becomes visible in TTFT rather than fully overlapped.

Measured results on Llama-3.1 8B

EXO Labs ran Llama-3.1 8B in FP16 with an 8,192-token prompt generating 32 tokens, comparing each box alone against the disaggregated cluster.

Configuration	Prefill	Generation	Total	Speedup
DGX Spark only	1.47s	2.87s	4.34s	1.9x
M3 Ultra Mac Studio only	5.57s	0.85s	6.42s	1.0x (baseline)
DGX Spark + M3 Ultra	1.47s	0.85s	2.32s	2.8x

The combined run keeps the Spark's prefill (3.8x faster than the Mac alone) and the Mac's decode (3.4x faster than the Spark alone). End-to-end latency drops from 6.42s on the Mac Studio to 2.32s across the pair.

Note: The headline "4x" figure circulated in coverage refers to the prefill-phase speedup over the Mac Studio in specific configurations. The 2.8x number is the wall-clock end-to-end gain on the 8B/8k workload above.

What EXO 1.0 handles for you

You do not write a schedule by hand. When EXO starts, it discovers every device on the local mesh network, then profiles each one for FP16 throughput, memory bandwidth, memory capacity, and link characteristics. Given a model and that topology, it decides which node prefills, which node decodes, whether to pipeline across layers, and when to stream KV.

The same orchestration adapts if a link slows down or a node's free memory shifts. The cluster's effective ceiling becomes the sum of its parts rather than the weakest single box.

When this pairing pays off

Long prompts, short answers. RAG, document Q&A, and code review hit prefill hard, which is exactly where the Spark's compute earns its keep.
Mid-size models that fit on one node. 8B to 70B-class models in 4-bit or 8-bit quantization fit comfortably in either machine's memory, which keeps the architecture simple.
GQA models at 5k+ context. Modern Llama-3, Qwen, and Mistral variants cross the overlap threshold quickly, so the network cost disappears into the pipeline.

The pairing is less interesting for short prompts under a few hundred tokens, where prefill is already trivial and the Mac Studio alone is fine, and for models so large they do not fit on the decode device, where tensor-parallel sharding across multiple Macs (or multiple Sparks) is the better fit.

Hardware and link requirements

Component	Spec used in the EXO benchmark
Prefill node	NVIDIA DGX Spark, 128 GB unified memory, 273 GB/s, ~100 TFLOPs FP16
Decode node	Mac Studio M3 Ultra, 512 GB unified memory, 819 GB/s, ~26 TFLOPs FP16
Interconnect	10 GbE between the two devices
Orchestrator	EXO 1.0 with automatic device discovery and phase placement
Test workload	Llama-3.1 8B FP16, 8,192-token prompt, 32-token generation

A faster link (for example, Thunderbolt-based IP networking or 25/40 GbE) shifts the overlap thresholds down, so even shorter prompts can hide the KV transfer. A slower link pushes them up, and at very low bandwidth the transfer becomes visible in TTFT.

How to verify the cluster is actually splitting work

Two signals confirm the pipeline is doing what you expect. First, the prefill duration on the cluster should match the Spark-only prefill within noise; if it matches the Mac-only prefill instead, decode placement is wrong or the orchestrator is not routing prefill to the Spark. Second, the decode tokens-per-second should match the Mac-only decode rate; if it matches the Spark's slower decode, the KV cache is not landing on the Mac before generation starts.

EXO's logs surface per-device phase assignment and per-layer transfer timing, which is the fastest way to confirm both conditions are met before running longer benchmarks.

The broader takeaway is architectural rather than brand-specific. Prefill and decode have different bottlenecks, so heterogeneous clusters that match each phase to the right hardware can outrun any single box at the same price point. The Spark-plus-Mac-Studio pairing is the cleanest current example, but the same disaggregation pattern applies to any mix of high-compute and high-bandwidth devices on a fast enough link.

DGX Spark + Mac Studio: Disaggregated LLM Inference With EXO

Why one box is the wrong unit

How disaggregated prefill and decode works

Measured results on Llama-3.1 8B

What EXO 1.0 handles for you

When this pairing pays off

Hardware and link requirements

How to verify the cluster is actually splitting work

Bite By Night Puppet Update Release Window and Confirmed Contents

Neverness to Everness Character List - Full v1.0 Roster, Espers, and How to Unlock

DGX Spark + Mac Studio: Disaggregated LLM Inference With EXO

Why one box is the wrong unit

How disaggregated prefill and decode works

Measured results on Llama-3.1 8B

What EXO 1.0 handles for you

When this pairing pays off

Hardware and link requirements

How to verify the cluster is actually splitting work

Bite By Night Puppet Update Release Window and Confirmed Contents

Neverness to Everness Character List - Full v1.0 Roster, Espers, and How to Unlock

All Things How