DGX Station profile: the local AI box everyone wants benchmarked

NVIDIA says DGX Station can support models up to 1 trillion parameters.
The machine has 748GB of coherent memory, but only 252GB of it is HBM. A 1T-class model can fit only under specific assumptions about the model, quant, cache, and runtime. Reading the 748GB number as 748GB of GPU-speed memory would be a very expensive mistake.
We went looking for numbers that clarify the difference.
We have been trying to understand the actual machine underneath the pitch. We have talked to Cornell researchers who had temporary access. We talked to NVIDIA. We inquired about buying one and priced it at about $100,000 per machine. We searched the NVIDIA forums, Reddit, X, LocalMaxxing, model cards, and conference-floor posts. We asked people at the AI Engineer local table for GLM-5.2 numbers because public numbers were thin.
This profile comes from that reporting. We have useful numbers, but no full benchmark suite yet.
The local AI interest itself is part of the story. At AI Engineer, the Local AI room was full of people asking the same memory-residency and throughput questions we were asking.
Photo credit: Ahmad Osman. Original X post: x.com/TheAhmadOsman/status/2072789682254180776.
Does DGX Station let you buy a lot of local model capacity without giving up the memory behavior that makes GPU inference fast?
The machine
NVIDIA's DGX Station page says the machine is powered by GB300 Grace Blackwell Ultra, has 748GB of coherent memory, and supports models up to 1 trillion parameters.
It does not give you 748GB of HBM.
The split that matters is:
| Memory tier | Capacity | Bandwidth | Why it matters |
|---|---|---|---|
| HBM3e | 252GB | 7.1TB/s | fast GPU memory tier |
| LPDDR5X | 496GB | 396GB/s | larger CPU-side memory tier |
| Total coherent memory | 748GB | mixed | addressable pool, not all GPU-speed memory |
| NVLink-C2C | n/a | 900GB/s listed | CPU-GPU link, with real workload caveats |
If your model and KV cache stay inside HBM, the machine is easy to understand. If the active workload crosses into LPDDR5X, the question becomes empirical: how much does prefill, decode, long context, and concurrency change?
Stas Bekman measured NVLink-C2C on DGX Station and reported that it was not behaving like 900GB/s full duplex end-to-end in his bidirectional test. Marketing bandwidth numbers do not automatically describe workload behavior.
Actual runs matter more than spec sheets.
The $100k question
The DGX Station price we got was about $100,000 per machine.
At that price, compare it with other ways to answer the same three questions: what can we run, how does it feel, and how much does it cost? The GLM-5.2-specific version of that comparison is in the GLM 5.2 local hardware requirements post.
For DGX Station, the competing buckets are:
| Alternative | What it buys |
|---|---|
| multi-GPU RTX PRO 6000 rigs | more conventional VRAM capacity, more assembly and operations work |
| cloud inference | no hardware ownership, but recurring per-token bills |
| Mac Studio-style unified memory | lots of addressable memory, much less GPU-style memory bandwidth |
| DGX Spark clusters | cheaper nodes, but different memory and interconnect tradeoffs |
The buying question is broader than "can I fit a big model?" A lot of machines can fit a heavily quantized big model somewhere in memory.
The buying question is: can we serve useful local frontier-ish workloads with enough speed, context, and concurrency to justify a six-figure workstation?
Why people are skeptical
The objection is specific.
This r/LocalLLaMA thread about 4x-8x RTX PRO 6000 systems, especially the comment thread starting here, gets at the objection. Someone asks why not buy DGX Station instead of 4-8 RTX PRO 6000s. The reply is basically: DGX Station does not actually have 748GB of VRAM. It has a 252GB HBM tier plus a larger LPDDR5X tier.
The NVIDIA Developer Forums have the same question in more formal language: can a 1T-class model really be served well on this memory layout, and has anyone seen real tokens/sec benchmarks for the GB300 DGX Station when context or weights move past HBM3e?
The useful standard is what happens to prefill, decode, context, and concurrency when the workload crosses the fast memory tier.
Who has touched one
DGX Station access is still scarce enough that the list of people with credible hands-on signal matters.
From our reporting, the names and groups we kept seeing were: Cornell's group, temporarily; Stas Bekman and Jeff Rasley around Snowflake work; Alex Cheema at Exo with Ahmad Osman and the local AI table around AI Engineer; Andrej Karpathy; and Matthew Berman.
Those names produced different kinds of signal. The public evidence is still coming from a small circle.
These are the public photos we can point at: lab machines, deliveries, show-floor hardware, Exo's local AI setup, and Snowflake's interior shot.
The most useful academic writeup we found came from Kilian Weinberger's group at Cornell, which had early remote access to a DGX Station in February. Three researchers used it on three different workloads: RL fine-tuning, diffusion language-model retrieval, and synthetic data generation.
The article is useful because it reports actual workloads. The RL work used the coherent memory to move from constrained Qwen3-4B runs toward cleaner 4B-7B experiments. The diffusion-LM retrieval work used the local GB300 box to iterate without cluster queues, doubling training batch size and cutting epoch time by about 20% versus their previous shared-cluster setup. The synthetic-data work ran Qwen3-30B-A3B-Instruct in BF16 with vLLM and reported 5.7x the throughput of a single A100 and 2.6x the throughput of a 4xA100 setup.
Stas Bekman and Snowflake published another useful example: post-training Qwen3-32B at 136K sequence length on one DGX Station. That run used ArcticTraining, TiledMLP, Liger-Kernel, BF16 optimizer offload in DeepSpeed, and CPU memory to make a long-sequence SFT recipe fit on a single B300. The writeup also notes the setup work that mattered: stable memory behavior, avoiding fragmentation, and switching the machine into the right coherent-memory mode so CPU memory was available for optimizer-state offload instead of fighting the GPU.
Cornell and Snowflake showed the machine helping real training and research workloads. Serving a 325GB or 500GB-class local frontier artifact on a machine with 252GB of HBM is a different question. They did not settle the local frontier-inference question. We talked to them about the big questions we had, and none of them knew the answer.
So... can it run big models?
The buyer question:
NVIDIA says DGX Station supports models up to 1 trillion parameters. We believe the capacity claim, with all the caveats above. A buyer should not read that as "load the full FP8 weights of any 700B-1T frontier open model and expect it to behave like it all lives in HBM."
The best public GLM-5.2 clue we have is the AI Engineer booth run. We had been asking for any real GLM-5.2 number on DGX Station from people standing next to the machine.
On X, we asked the AI Engineer local table whether anyone could post metrics for the GLM-5.2 4-bit quant running on DGX Station. A couple hours later, 0xSero posted: GLM-5.2-REAP on DGX Station, drumroll...
Rick Blalock posted a short X video of that booth demo. You can get a visual sense of the generation speed.
Video credit: Rick Blalock (@rblalock). Original X post: x.com/rblalock/status/2072786938147586503.
For cloud context, OpenRouter's Claude Opus 4.8 page shows 64 tok/s as the best throughput across providers, and about 55 tok/s for Anthropic in its one-week provider breakdown.
Alec Fong later added that their first stab at running GLM with NVFP4 was about 25 tok/s, before Alex Cheema and 0xSero helped push the run to 60 tok/s. The evidence trail points toward Luke Alonso's GLM-5.2-NVFP4: Ahmad had posted that Luke uploaded a 467GB NVFP4 artifact, then replied to Sentdex that he was working on getting it running. Alec's post does not name the exact artifact, so we are treating that as strong context, not a confirmed model ID.
The model selector appears to show GLM-5.2 REAP 504B, which points to the 0xSero GLM-5.2 REAP 504B GGUF family rather than the full GLM-5.2 FP8/BF16 model. REAP is expert pruning for MoE models, and 0xSero's own Terminal-Bench 2.1 number is lower than the full-model number we have seen.
The model that ran locally was shaped to run locally.
Here are the big-model numbers we have been able to get:
| Model / workload | Number | What we know |
|---|---|---|
| Kimi 2.5, 1.1T | 40-50 tok/s total output across all users | NVIDIA rep number; about 595GB model weights; we still need benchmark conditions |
| Nemotron Ultra, 550B | about 35 tok/s at concurrency 1; scales to 4-5 concurrent users | NVIDIA rep number; useful because it includes a concurrency claim |
| GLM-5.2-REAP 504B | about 60 tok/s | public 0xSero number from AI Engineer; Alec Fong says an earlier GLM NVFP4 attempt was about 25 tok/s; still missing exact quant, prefill, context, memory residency, and concurrency |
These numbers deserve credit because they were hard to come by. They also have different measurement conditions, so the table should not be read as a leaderboard.
For the GLM-5.2 model-sizing side, see the GLM 5.2 local hardware requirements. This profile is about DGX Station as a machine.
Configurations, costs, who ships, and when
NVIDIA lists DGX Station systems from ASUS, Dell, Exxact, Gigabyte, HP, MSI, and Supermicro on the official DGX Station page, and its marketplace links out to buying options for GB300 systems.
The lineup is already messy in the way workstation launches are messy: some pages are configure-and-quote, some are sales conversations, some are regional distributors, and some public photos are special hand-delivered early systems rather than normal orders.
A LocalLLaMA user put the GB300 OEM systems side by side. It is a community visual, not an official sizing chart. Source: r/LocalLLaMA.
Here is what we could pin down as of July 2, 2026:
| Vendor / path | Public source | Price we saw | Lead time / status |
|---|---|---|---|
| NVIDIA official partner list | DGX Station page and NVIDIA Marketplace | n/a | Partners listed: ASUS, Dell, Exxact, Gigabyte, HP, MSI, and Supermicro |
| Supermicro Super AI Station GB300 | Supermicro, plus pi3g | $99,500 excluding VAT in pi3g table | 4-6 weeks / August in pi3g table; we separately heard 4-6 weeks from the Supermicro side |
| Exxact TensorEX / Valence GB300 workstation | Exxact, plus pi3g | Exxact starts at $96,671.30; pi3g table shows $101,000 | Exxact has the clearest public configure path we found; pi3g lists 6-8 weeks / end of August |
| MSI XpertStation WS300 GB300 | pi3g | $107,000 excluding VAT | 10-12 weeks / mid-October |
| Gigabyte W775-V10-L01 | pi3g | $116,500 excluding VAT | 8-10 weeks / end of August or beginning of September |
| ASUS ExpertCenter Pro ET900N G3 | pi3g | TBD | 4-8 weeks / September |
| Dell Pro Max with GB300 | Dell, plus pi3g | TBD | Public Dell page was sales/presale flow; pi3g marked unclear, probably not in July |
| HP ZGX Fury AI Station | pi3g | TBD | TBD |
Privately, we heard from an NVIDIA rep that the first non-special-delivery orders were starting to move this week, with Supermicro expected first. We separately heard 4-6 weeks lead time from the Supermicro side. The public early-delivery photos, though, have mostly been Dell Pro Max with GB300 units.
What would make it real
For DGX Station, a useful benchmark needs to disclose:
| Measurement | Required detail |
|---|---|
| model | exact repo, commit, quant, shard set |
| runtime | llama.cpp, vLLM, SGLang, TensorRT-LLM, or vendor stack |
| memory residency | HBM vs LPDDR5X/system memory, if observable |
| context | prompt tokens, generated tokens, KV cache settings |
| prefill | tokens/sec for prompt processing |
| decode | output tokens/sec |
| concurrency | 1 user, 2 users, 4 users, 5 users, 8+ users |
| quality | benchmark score, loop rate, tool-call reliability, coding task pass rate |
| thermals/power | whether speed changes over a longer session |
At Cyrus, we are going to keep collecting these numbers and publish the ones we can stand behind.
Right now, DGX Station looks like the most serious local frontier box you can plausibly buy as a single machine. It also still has the exact question local AI people care about: what happens when the workload needs more than 252GB of HBM?
That is the benchmark we want. That is the benchmark we're going to get you.
Glossary
| Term | Meaning |
|---|---|
| HBM | High Bandwidth Memory. On DGX Station, this is the 252GB fast GPU memory tier. |
| HBM3e | The generation of HBM used in DGX Station. NVIDIA lists it at 7.1TB/s. |
| LPDDR5X | Lower-power system memory. On DGX Station, this is the 496GB CPU-side memory tier. |
| VRAM | GPU memory. In this debate, people often use it to mean memory with GPU-class bandwidth, not just addressable capacity. |
| Unified memory | Memory shared by CPU and GPU. It can make larger local runs possible, but it is not automatically equivalent to HBM or VRAM. |
| System RAM | CPU-side memory outside the fast GPU memory tier. It can add capacity, but speed depends on what the runtime has to fetch from it. |
| GGUF | A model file format commonly used by llama.cpp for local inference. |
| REAP | Router-weighted Expert Activation Pruning. It scores MoE experts by saliency, roughly gate weight times expert-output norm, over a calibration set. 0xSero's GLM-5.2 504B report says it keeps 168 of 256 routed experts per layer. |
| BF16 | 16-bit brain floating point weights. Larger than 4-bit or 3-bit quantized weights. |
| Quant | A quantized model artifact, usually lower-bit than BF16. Q4_K_XL, Q3_K_XL, and Q2_K_XL are examples from the GGUF card. |
| NVFP4 | NVIDIA 4-bit floating point format used in some GLM-5.2 local reports. |
| AWQ | Activation-aware weight quantization, a quantization method used in some optimized serving recipes. |
| MTP | Multi-token prediction. It can improve output speed when the model and runtime support it well. |
| KV cache | Runtime memory used to store attention keys and values as context grows. It is separate from model weights. |
| Prefill | Prompt-processing speed before the model starts generating output tokens. |
| Decode | Output-token generation speed after prefill. |
| tok/s | Tokens per second. In this post, it usually means output tokens per second unless stated otherwise. |
| Concurrency | How many users or requests the machine serves at the same time. A one-user tok/s number does not settle this. |

Break free from the terminal
As your Claude Code powered Linear agent, Cyrus is capable of accomplishing whatever large or small issues you throw at it. Get PMs, designers and the CEO shipping product.