DGX Station profile: the local AI box everyone wants benchmarked

Connor Turland·July 02, 2026·

NVIDIA says DGX Station can support models up to 1 trillion parameters.

The machine has 748GB of coherent memory, but only 252GB of it is HBM. A 1T-class model can fit only under specific assumptions about the model, quant, cache, and runtime. Reading the 748GB number as 748GB of GPU-speed memory would be a very expensive mistake.

We went looking for numbers that clarify the difference.

We have been trying to understand the actual machine underneath the pitch. We have talked to Cornell researchers who had temporary access. We talked to NVIDIA. We inquired about buying one and priced it at about $100,000 per machine. We searched the NVIDIA forums, Reddit, X, LocalMaxxing, model cards, and conference-floor posts. We asked people at the AI Engineer local table for GLM-5.2 numbers because public numbers were thin.

This profile comes from that reporting. We have useful numbers, but no full benchmark suite yet.

The local AI interest itself is part of the story. At AI Engineer, the Local AI room was full of people asking the same memory-residency and throughput questions we were asking.

A full Local AI room at AI Engineer World's Fair

Photo credit: Ahmad Osman. Original X post: x.com/TheAhmadOsman/status/2072789682254180776.

Does DGX Station let you buy a lot of local model capacity without giving up the memory behavior that makes GPU inference fast?

The machine

NVIDIA's DGX Station page says the machine is powered by GB300 Grace Blackwell Ultra, has 748GB of coherent memory, and supports models up to 1 trillion parameters.

It does not give you 748GB of HBM.

The split that matters is:

Memory tier	Capacity	Bandwidth	Why it matters
HBM3e	252GB	7.1TB/s	fast GPU memory tier
LPDDR5X	496GB	396GB/s	larger CPU-side memory tier
Total coherent memory	748GB	mixed	addressable pool, not all GPU-speed memory
NVLink-C2C	n/a	900GB/s listed	CPU-GPU link, with real workload caveats

If your model and KV cache stay inside HBM, the machine is easy to understand. If the active workload crosses into LPDDR5X, the question becomes empirical: how much does prefill, decode, long context, and concurrency change?

Stas Bekman measured NVLink-C2C on DGX Station and reported that it was not behaving like 900GB/s full duplex end-to-end in his bidirectional test. Marketing bandwidth numbers do not automatically describe workload behavior.

Actual runs matter more than spec sheets.

The $100k question

The DGX Station price we got was about $100,000 per machine.

At that price, compare it with other ways to answer the same three questions: what can we run, how does it feel, and how much does it cost? The GLM-5.2-specific version of that comparison is in the GLM 5.2 local hardware requirements post.

For DGX Station, the competing buckets are:

Alternative	What it buys
multi-GPU RTX PRO 6000 rigs	more conventional VRAM capacity, more assembly and operations work
cloud inference	no hardware ownership, but recurring per-token bills
Mac Studio-style unified memory	lots of addressable memory, much less GPU-style memory bandwidth
DGX Spark clusters	cheaper nodes, but different memory and interconnect tradeoffs

The buying question is broader than "can I fit a big model?" A lot of machines can fit a heavily quantized big model somewhere in memory.

The buying question is: can we serve useful local frontier-ish workloads with enough speed, context, and concurrency to justify a six-figure workstation?

Why people are skeptical

The objection is specific.

This r/LocalLLaMA thread about 4x-8x RTX PRO 6000 systems, especially the comment thread starting here, gets at the objection. Someone asks why not buy DGX Station instead of 4-8 RTX PRO 6000s. The reply is basically: DGX Station does not actually have 748GB of VRAM. It has a 252GB HBM tier plus a larger LPDDR5X tier.

The NVIDIA Developer Forums have the same question in more formal language: can a 1T-class model really be served well on this memory layout, and has anyone seen real tokens/sec benchmarks for the GB300 DGX Station when context or weights move past HBM3e?

The useful standard is what happens to prefill, decode, context, and concurrency when the workload crosses the fast memory tier.

Who has touched one

DGX Station access is still scarce enough that the list of people with credible hands-on signal matters.

From our reporting, the names and groups we kept seeing were: Cornell's group, temporarily; Stas Bekman and Jeff Rasley around Snowflake work; Alex Cheema at Exo with Ahmad Osman and the local AI table around AI Engineer; Andrej Karpathy; and Matthew Berman.

Those names produced different kinds of signal. The public evidence is still coming from a small circle.

These are the public photos we can point at: lab machines, deliveries, show-floor hardware, Exo's local AI setup, and Snowflake's interior shot.

Jensen Huang and Andrej Karpathy with the first DGX Station GB300, described by NVIDIA as a Dell Pro Max with GB300 — **The man, the legend, Andrej.** NVIDIA AI Developer described Karpathy's unit as the first DGX Station GB300, a Dell Pro Max with GB300, personally signed by Jensen Huang. Photo credit: NVIDIA AI Developer. Original X post.

Nader Khalil's photo of the first DGX Station online event, with the GB300 workstation open in the room — **First DGX Station online.** Nader Khalil posted this around the first DGX Station coming online. Photo credit: Nader Khalil. Original X post.

Matthew Berman receiving a pre-production Dell Pro Max with GB300 that he described as a 100 pound system with 750GB plus unified memory — **Matthew Berman delivery.** NVIDIA hand-delivered a pre-production Dell Pro Max with GB300; Berman described it as a 100 lb system with 750GB+ unified memory. Video thumbnail credit: Matthew Berman. Original X post.

Interior of Snowflake's delivered Dell Pro Max with GB300, posted by Jeff Rasley in a DGX Station thread with Stas Bekman and NVIDIA — **Snowflake's Dell Pro Max with GB300.** Jeff Rasley's photo shows the interior of the machine Snowflake had delivered. Photo credit: Jeff Rasley. Original X post.

Alex Cheema with local AI hardware from NVIDIA, in a post listing 16 DGX Spark, 3 RTX Spark, 1 DGX Station, ConnectX-7 cables, and switches — **Exo local AI hardware.** Alex Cheema's post lists 16 DGX Spark, 3 RTX Spark, 1 DGX Station, ConnectX-7 cables, and two high-speed switches. Photo credit: Alex Cheema. Original X post.

DGX Station hardware at the AI Engineer local AI table, where the GLM-5.2 REAP booth number surfaced — **AI Engineer local AI table.** This is the table where the GLM-5.2 REAP DGX Station number surfaced. Photo credit: Chris Alexiuk. Original X post.

The most useful academic writeup we found came from Kilian Weinberger's group at Cornell, which had early remote access to a DGX Station in February. Three researchers used it on three different workloads: RL fine-tuning, diffusion language-model retrieval, and synthetic data generation.

The article is useful because it reports actual workloads. The RL work used the coherent memory to move from constrained Qwen3-4B runs toward cleaner 4B-7B experiments. The diffusion-LM retrieval work used the local GB300 box to iterate without cluster queues, doubling training batch size and cutting epoch time by about 20% versus their previous shared-cluster setup. The synthetic-data work ran Qwen3-30B-A3B-Instruct in BF16 with vLLM and reported 5.7x the throughput of a single A100 and 2.6x the throughput of a 4xA100 setup.

Stas Bekman and Snowflake published another useful example: post-training Qwen3-32B at 136K sequence length on one DGX Station. That run used ArcticTraining, TiledMLP, Liger-Kernel, BF16 optimizer offload in DeepSpeed, and CPU memory to make a long-sequence SFT recipe fit on a single B300. The writeup also notes the setup work that mattered: stable memory behavior, avoiding fragmentation, and switching the machine into the right coherent-memory mode so CPU memory was available for optimizer-state offload instead of fighting the GPU.

Cornell and Snowflake showed the machine helping real training and research workloads. Serving a 325GB or 500GB-class local frontier artifact on a machine with 252GB of HBM is a different question. They did not settle the local frontier-inference question. We talked to them about the big questions we had, and none of them knew the answer.

So... can it run big models?

The buyer question:

NVIDIA says DGX Station supports models up to 1 trillion parameters. We believe the capacity claim, with all the caveats above. A buyer should not read that as "load the full FP8 weights of any 700B-1T frontier open model and expect it to behave like it all lives in HBM."

The best public GLM-5.2 clue we have is the AI Engineer booth run. We had been asking for any real GLM-5.2 number on DGX Station from people standing next to the machine.

On X, we asked the AI Engineer local table whether anyone could post metrics for the GLM-5.2 4-bit quant running on DGX Station. A couple hours later, 0xSero posted: GLM-5.2-REAP on DGX Station, drumroll...

60 tokens/second

Rick Blalock posted a short X video of that booth demo. You can get a visual sense of the generation speed.

Video credit: Rick Blalock (@rblalock). Original X post: x.com/rblalock/status/2072786938147586503.

For cloud context, OpenRouter's Claude Opus 4.8 page shows 64 tok/s as the best throughput across providers, and about 55 tok/s for Anthropic in its one-week provider breakdown.

Alec Fong later added that their first stab at running GLM with NVFP4 was about 25 tok/s, before Alex Cheema and 0xSero helped push the run to 60 tok/s. The evidence trail points toward Luke Alonso's GLM-5.2-NVFP4: Ahmad had posted that Luke uploaded a 467GB NVFP4 artifact, then replied to Sentdex that he was working on getting it running. Alec's post does not name the exact artifact, so we are treating that as strong context, not a confirmed model ID.

The model selector appears to show GLM-5.2 REAP 504B, which points to the 0xSero GLM-5.2 REAP 504B GGUF family rather than the full GLM-5.2 FP8/BF16 model. REAP is expert pruning for MoE models, and 0xSero's own Terminal-Bench 2.1 number is lower than the full-model number we have seen.

The model that ran locally was shaped to run locally.

Here are the big-model numbers we have been able to get:

Model / workload	Number	What we know
Kimi 2.5, 1.1T	40-50 tok/s total output across all users	NVIDIA rep number; about 595GB model weights; we still need benchmark conditions
Nemotron Ultra, 550B	about 35 tok/s at concurrency 1; scales to 4-5 concurrent users	NVIDIA rep number; useful because it includes a concurrency claim
GLM-5.2-REAP 504B	about 60 tok/s	public 0xSero number from AI Engineer; Alec Fong says an earlier GLM NVFP4 attempt was about 25 tok/s; still missing exact quant, prefill, context, memory residency, and concurrency

These numbers deserve credit because they were hard to come by. They also have different measurement conditions, so the table should not be read as a leaderboard.

For the GLM-5.2 model-sizing side, see the GLM 5.2 local hardware requirements. This profile is about DGX Station as a machine.

Configurations, costs, who ships, and when

NVIDIA lists DGX Station systems from ASUS, Dell, Exxact, Gigabyte, HP, MSI, and Supermicro on the official DGX Station page, and its marketplace links out to buying options for GB300 systems.

The lineup is already messy in the way workstation launches are messy: some pages are configure-and-quote, some are sales conversations, some are regional distributors, and some public photos are special hand-delivered early systems rather than normal orders.

Reddit image comparing DGX Station GB300 OEM systems side by side at roughly actual size

A LocalLLaMA user put the GB300 OEM systems side by side. It is a community visual, not an official sizing chart. Source: r/LocalLLaMA.

Here is what we could pin down as of July 2, 2026:

Vendor / path	Public source	Price we saw	Lead time / status
NVIDIA official partner list	DGX Station page and NVIDIA Marketplace	n/a	Partners listed: ASUS, Dell, Exxact, Gigabyte, HP, MSI, and Supermicro
Supermicro Super AI Station GB300	Supermicro, plus pi3g	$99,500 excluding VAT in pi3g table	4-6 weeks / August in pi3g table; we separately heard 4-6 weeks from the Supermicro side
Exxact TensorEX / Valence GB300 workstation	Exxact, plus pi3g	Exxact starts at $96,671.30; pi3g table shows $101,000	Exxact has the clearest public configure path we found; pi3g lists 6-8 weeks / end of August
MSI XpertStation WS300 GB300	pi3g	$107,000 excluding VAT	10-12 weeks / mid-October
Gigabyte W775-V10-L01	pi3g	$116,500 excluding VAT	8-10 weeks / end of August or beginning of September
ASUS ExpertCenter Pro ET900N G3	pi3g	TBD	4-8 weeks / September
Dell Pro Max with GB300	Dell, plus pi3g	TBD	Public Dell page was sales/presale flow; pi3g marked unclear, probably not in July
HP ZGX Fury AI Station	pi3g	TBD	TBD

Privately, we heard from an NVIDIA rep that the first non-special-delivery orders were starting to move this week, with Supermicro expected first. We separately heard 4-6 weeks lead time from the Supermicro side. The public early-delivery photos, though, have mostly been Dell Pro Max with GB300 units.

What would make it real

For DGX Station, a useful benchmark needs to disclose:

Measurement	Required detail
model	exact repo, commit, quant, shard set
runtime	llama.cpp, vLLM, SGLang, TensorRT-LLM, or vendor stack
memory residency	HBM vs LPDDR5X/system memory, if observable
context	prompt tokens, generated tokens, KV cache settings
prefill	tokens/sec for prompt processing
decode	output tokens/sec
concurrency	1 user, 2 users, 4 users, 5 users, 8+ users
quality	benchmark score, loop rate, tool-call reliability, coding task pass rate
thermals/power	whether speed changes over a longer session

At Cyrus, we are going to keep collecting these numbers and publish the ones we can stand behind.

Right now, DGX Station looks like the most serious local frontier box you can plausibly buy as a single machine. It also still has the exact question local AI people care about: what happens when the workload needs more than 252GB of HBM?

That is the benchmark we want. That is the benchmark we're going to get you.

Glossary

Term	Meaning
HBM	High Bandwidth Memory. On DGX Station, this is the 252GB fast GPU memory tier.
HBM3e	The generation of HBM used in DGX Station. NVIDIA lists it at 7.1TB/s.
LPDDR5X	Lower-power system memory. On DGX Station, this is the 496GB CPU-side memory tier.
VRAM	GPU memory. In this debate, people often use it to mean memory with GPU-class bandwidth, not just addressable capacity.
Unified memory	Memory shared by CPU and GPU. It can make larger local runs possible, but it is not automatically equivalent to HBM or VRAM.
System RAM	CPU-side memory outside the fast GPU memory tier. It can add capacity, but speed depends on what the runtime has to fetch from it.
GGUF	A model file format commonly used by llama.cpp for local inference.
REAP	Router-weighted Expert Activation Pruning. It scores MoE experts by saliency, roughly gate weight times expert-output norm, over a calibration set. 0xSero's GLM-5.2 504B report says it keeps 168 of 256 routed experts per layer.
BF16	16-bit brain floating point weights. Larger than 4-bit or 3-bit quantized weights.
Quant	A quantized model artifact, usually lower-bit than BF16. Q4_K_XL, Q3_K_XL, and Q2_K_XL are examples from the GGUF card.
NVFP4	NVIDIA 4-bit floating point format used in some GLM-5.2 local reports.
AWQ	Activation-aware weight quantization, a quantization method used in some optimized serving recipes.
MTP	Multi-token prediction. It can improve output speed when the model and runtime support it well.
KV cache	Runtime memory used to store attention keys and values as context grows. It is separate from model weights.
Prefill	Prompt-processing speed before the model starts generating output tokens.
Decode	Output-token generation speed after prefill.
tok/s	Tokens per second. In this post, it usually means output tokens per second unless stated otherwise.
Concurrency	How many users or requests the machine serves at the same time. A one-user tok/s number does not settle this.

Team of engineers preparing a rocket for launch - Ship your next feature with Cyrus

Trusted by product teams at Retool, Gamma, TinyFish, and more

Break free from the terminal

As your Claude Code powered Linear agent, Cyrus is capable of accomplishing whatever large or small issues you throw at it. Get PMs, designers and the CEO shipping product.

Start shipping 20x faster