GLM 5.2 local hardware requirements: reported paths by quant

GLM 5.2 local hardware requirements: reported paths by quant
Connor Turland·July 02, 2026·

Most "GLM-5.2 local" discussion mixes together different things.

We should separate them before talking about requirements.

  1. the full zai-org/GLM-5.2 model
  2. low-bit GGUF builds of the full model
  3. pruned derivatives like 0xSero/GLM-5.2-504B
  4. GGUF builds of that pruned derivative, like 0xSero/GLM-5.2-REAP-504B-GGUF

Those are not the same hardware problem.

At Cyrus, we care about this because coding agents are not short chat demos. They keep repository context around, call tools, recover from mistakes, and spend a lot of time in long prompts. For that workload, "the model loaded" is not the same as "the model is usable."

Reported hardware paths

If someone says "GLM-5.2 runs locally," the next question is: which artifact, which quant, and where are the weights actually sitting?

These are the GLM-5.2-family reports we have been able to tie to specific hardware. They are not all the same benchmark.

Hardware Model Quantization Technical identifier Size / memory clue Reported feel Source
DGX Station booth demo running GLM-5.2 REAP DGX Station GLM-5.2-REAP 504B unknown GLM-5.2-REAP-504B-GGUF; exact quant not confirmed if Q4_K_XL, card lists about 325-331GB; if smaller, conclusion changes about 60 tok/s; no prefill, context, or concurrency yet 0xSero on X
Four DGX Spark units arranged together 4x DGX Spark / GB10 full GLM-5.2 4-bit UD-IQ4_XS about 365GB across nodes 6.28 tok/s decode at C=1 NVIDIA forum recipe
4x GB10 full GLM-5.2 with routed-expert prune 4-bit AWQ INT4 + pruning + MTP custom optimized recipe about 20-22 tok/s decode NVIDIA forum recipe
4x DGX Spark full GLM-5.2, not REAP 4-bit NVFP4 hybrid patched vLLM, tight memory assumptions 14.5-15.2 tok/s decode, 450-512 tok/s prefill NVIDIA forum recipe
one or two M3 Ultra 512GB systems full GLM-5.2 4-bit NVFP4 512GB unified memory system(s) 18.8 tok/s on one, 23.4 tok/s on two; basic decode test Ivan Fioravanti on X
2x RTX PRO 6000 Blackwell plus 1TB DDR5 full GLM-5.2 4-bit UD-Q4_K_XL 1TB DDR5 in the system; not all weights live in VRAM 13-15 tok/s decode, 64 tok/s prefill Samuel Cardillo on X
4x RTX 3090 plus 192GB DDR5 full GLM-5.2 2-bit UD-IQ2_M 223GB on disk; 96GB VRAM plus host RAM about 7.3 tok/s decode, about 135 tok/s prefill Reddit report
2x RTX 5090 plus 512GB DDR5 ECC full GLM-5.2 5-bit UD-Q5_K_S 492GB weights; high-RAM workstation about 12 tok/s Reddit report
Dell PowerEdge R740, dual Xeon, 768GB RAM full GLM-5.2 2-bit UD-Q2-K_XL CPU-only, 768GB RAM 4-5.5 tok/s basic chat, about 3 tok/s in opencode Reddit report

The pattern is clear. Low-bit artifacts can run on varied hardware. Higher-quality artifacts require either a lot of GPU memory, a lot of unified memory, or a willingness to put host RAM in the hot path.

Cost read

The practical questions are: what can we run, how does it feel, and what does it cost to get there?

This is not a shopping guide. Prices move, used hardware is messy, and several of these runs depend on custom recipes. But the cost shape matters because "GLM-5.2 locally" can mean anything from a slow CPU experiment to a six-figure DGX Station.

PathRough hardware cost signalWhat can runHow it feels from reports
CPU-only server with 768GB RAMcheap if already owned; not a great reason to buy a server from scratchlow-bit full GLM-5.2possible, but 3-5.5 tok/s is a patience test
4x used RTX 3090 plus host RAMusually the lowest-cost GPU path in these reports2-bit full GLM-5.2 through offloadabout 7.3 tok/s decode; interesting, not luxurious
2x RTX 5090 plus 512GB RAMhigh-end consumer workstation pricing, very market-dependent5-bit full GLM-5.2 with host RAM in the storyabout 12 tok/s; capacity is doing a lot of work
2x RTX PRO 6000 Blackwell plus 1TB RAMNVIDIA marketplace listed the RTX PRO 6000 Blackwell Workstation Edition at $13,250 per GPU when we checked4-bit full GLM-5.213-15 tok/s decode, 64 tok/s prefill
4x DGX Spark / GB10NVIDIA marketplace listed DGX Spark at $4,699 when we checked, so four nodes are about $18.8k before anything elsefull GLM-5.2 4-bit-class paths, depending recipe6.28 tok/s on one reproducible llama.cpp path; higher with more custom recipes
M3 Ultra 512GB systemsa reported run, but not a clean current-buying recipefull GLM-5.2 NVFP418.8 tok/s on one, 23.4 on two in a basic decode test
DGX Stationabout $100k from our inquiry; some public partner listings have been in the same neighborhoodGLM-5.2-REAP 504B shown at AI Engineer; larger local frontier workloads are the questionabout 60 tok/s for the reported GLM-5.2-REAP run, but missing exact quant, context, prefill, and concurrency

That table is why the DGX Station question is not just "is it expensive?" It is whether the machine can turn a much larger coherent-memory box into a better local-agent experience than the cheaper paths above. We split the DGX Station memory profile into its own post because that question depends on the 252GB HBM tier, not just the 748GB headline.

Artifact size and capacity matrix

This is the closest thing to a GLM-5.2 requirements table, but it still needs caveats.

Artifact classSize / memory clueReported hardware that ran itCapacity read
Full GLM-5.2 1-bit classabout 202-223GB in reports2x DGX Spark, 2x M5 Max256GB+ memory can be enough, but quality is low-bit
Full GLM-5.2 2-bit classabout 223-245GB in reports4x3090 plus RAM, 5090+3090 plus 256GB RAM, CPU-only 768GB RAMthis is where local runs become common, but not necessarily pleasant
Full GLM-5.2 4-bit / NVFP4 / AWQ classroughly 365-467GB depending artifact and compression4x DGX Spark, 2x M3 Ultra, 2x RTX PRO 6000 plus 1TB RAMthis is the serious local hardware tier
Full GLM-5.2 5-bit classabout 492-570GB2x RTX 5090 plus 512GB RAM for one reportcapacity is possible; speed depends on how much leaves fast memory
Full GLM-5.2 8-bit classabout 810GBno clean local serving report in this passoutside most single-workstation setups
REAP 504B Q4_K_XLabout 325-331GBDGX Station report may be this class, but exact quant is not confirmedbigger than 252GB of DGX Station HBM before KV cache/runtime overhead
REAP 504B Q3_K_XLabout 259GBno clean measured source in this passbarely above 252GB before overhead, so raw size math is not enough
REAP 504B Q2_K_XLabout 111GBLocalMaxxing tracks REAP GGUF and reports 7.9 tok/s top speed, but we still need the exact run tablemuch easier to fit, much less useful as evidence for 4-bit-quality behavior

The important part is not that the table has big numbers. The important part is that "GLM-5.2 local requirements" changes depending on which row you mean.

We also have two adoption and benchmark context points that are worth keeping separate from hardware requirements: 0xSero said the REAP work reached 15,000 downloads in 10 days and that Zai highlighted it at AI Engineer, and the REAP 504B model card reports 70.5% on Terminal-Bench 2.1 full-89. Those are not throughput numbers.

Sentdex has a practical segment starting around 23:16, and we are treating it as one case study rather than the anchor for the article. The parts worth pulling forward are narrow: he walks through the 8-bit memory math around 26:23, says Q4_K_XL is the tier he would want if it fits around 27:24, and says the Q3_K_XL tier he tried did not feel like the same class of model around 27:57.

What GLM 5.2 requirements actually mean

There are two separate requirements.

First: enough memory to load the weights.

Second: enough fast memory and runtime headroom to serve the workload.

The second requirement is the one that gets lost. A 245GB 2-bit full-model GGUF and a 325GB Q4_K_XL REAP GGUF are both "local GLM-5.2" to a search engine. They are not the same thing to a machine.

For coding agents, we will measure at least:

MetricWhy we care
model artifactfull 753B, REAP 504B, NVFP4, GGUF, 1-bit, 2-bit, 4-bit
memory placementHBM/VRAM vs LPDDR/system RAM
prefill tokens/secrepo-sized prompts spend time here
decode tokens/secvisible output speed
context length4K, 64K, 256K, and 1M are different tests
concurrencyone user and five users can expose different bottlenecks
KV cache policylong sessions depend on this
qualitybenchmark score, loop rate, tool reliability, coding pass rate

We will bring more of these numbers as we collect them. The hard part is not running one prompt. The hard part is getting comparable runs with the exact model artifact, runtime, quant, context, cache settings, and hardware disclosed.

How to run GLM 5.2 locally

If you want the full model name, start with a full-model GGUF from the Unsloth path and choose the quant tier that fits your memory budget. The public memory table says 223GB for 1-bit, 245GB for 2-bit, 372-475GB for 4-bit, and 570GB for 5-bit.

If you want the 0xSero REAP 504B GGUF path, the Hugging Face model card gives two llama.cpp shapes.

If you install through the llama.app installer shown on the model card, the command uses the unified llama launcher:

llama serve -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

or:

llama cli -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

If you install llama.cpp from a prebuilt binary, source build, or Homebrew, the executable names are usually explicit:

llama-server -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

or:

llama-cli -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

We verified locally that llama-cli and llama-server both accept -hf as the Hugging Face repo argument. The command does not make Q4_K_XL fit. It only names the artifact. The listed size is about 325GB before KV cache and runtime overhead.

What we would not claim yet

We would not claim that REAP 504B equals full GLM-5.2. The 70.5% Terminal-Bench 2.1 full-89 number is lower than the full-model numbers we have seen, and the harnesses are not identical.

We would not claim that a 2-bit run tells you what a 4-bit run will feel like.

We would not claim that total memory is enough. That is why the DGX Station memory question matters: DGX Station has 748GB coherent memory, but only 252GB is HBM3e.

Glossary

Term Meaning
GGUF A model file format commonly used by llama.cpp for local inference.
REAP Router-weighted Expert Activation Pruning. It scores MoE experts by saliency, roughly gate weight times expert-output norm, over a calibration set. 0xSero's GLM-5.2 504B report says it keeps 168 of 256 routed experts per layer.
Quant A quantized model artifact. Lower-bit quants use less memory, but quality and runtime behavior can change.
BPW Bits per weight. A rough way to describe how compressed the model weights are.
BF16 16-bit brain floating point weights. Larger than 4-bit, 3-bit, or 2-bit quantized artifacts.
NVFP4 NVIDIA 4-bit floating point format used in some GLM-5.2 local reports.
AWQ Activation-aware weight quantization, a quantization approach used in some optimized serving recipes.
MTP Multi-token prediction. A speculative decoding technique that can improve output speed when it works well.
MLA Multi-head latent attention. It changes memory and cache behavior compared with more conventional attention layouts.
DSA DeepSeek sparse attention, the attention architecture referenced in several GLM-5.2 local serving recipes.
KV cache Runtime memory used to store attention keys and values as context grows. It is separate from model weights.
Prefill Prompt-processing speed before the model starts generating output tokens.
Decode Output-token generation speed after prefill.
tok/s Tokens per second. Check whether a source means decode speed, prefill speed, or aggregate output across users.
TTFT Time to first token. Long prompts can make this matter as much as decode speed.
VRAM GPU memory. For large local models, capacity and bandwidth both matter.
HBM High Bandwidth Memory. This is the fast memory tier people care about on DGX Station and datacenter GPUs.
HBM3e A generation of High Bandwidth Memory. DGX Station's fast memory tier is HBM3e.
LPDDR Lower-power system memory. It can add capacity, but it is not the same thing as HBM or GPU VRAM.
Unified memory Memory shared by CPU and GPU, as on Apple Silicon systems and some NVIDIA systems. It is not automatically equivalent to high-bandwidth VRAM.
Host RAM System memory outside the GPU. Offloading model weights here can make a run possible, but often changes speed.
Concurrency How many requests or users are served at once. A single-user tok/s number does not predict multi-user behavior.
Team of engineers preparing a rocket for launch - Ship your next feature with Cyrus
Trusted by product teams at Retool, Gamma, TinyFish, and more

Break free from the terminal

As your Claude Code powered Linear agent, Cyrus is capable of accomplishing whatever large or small issues you throw at it. Get PMs, designers and the CEO shipping product.