GLM 5.2 local hardware requirements: reported paths by quant

Connor Turland·July 02, 2026·

Most "GLM-5.2 local" discussion mixes together different things.

We should separate them before talking about requirements.

the full zai-org/GLM-5.2 model
low-bit GGUF builds of the full model
pruned derivatives like 0xSero/GLM-5.2-504B
GGUF builds of that pruned derivative, like 0xSero/GLM-5.2-REAP-504B-GGUF

Those are not the same hardware problem.

At Cyrus, we care about this because coding agents are not short chat demos. They keep repository context around, call tools, recover from mistakes, and spend a lot of time in long prompts. For that workload, "the model loaded" is not the same as "the model is usable."

Reported hardware paths

If someone says "GLM-5.2 runs locally," the next question is: which artifact, which quant, and where are the weights actually sitting?

These are the GLM-5.2-family reports we have been able to tie to specific hardware. They are not all the same benchmark.

Hardware	Model	Quantization	Technical identifier	Size / memory clue	Reported feel	Source
DGX Station	GLM-5.2-REAP 504B	unknown	`GLM-5.2-REAP-504B-GGUF`; exact quant not confirmed	if `Q4_K_XL`, card lists about 325-331GB; if smaller, conclusion changes	about 60 tok/s; no prefill, context, or concurrency yet	0xSero on X
4x DGX Spark / GB10	full GLM-5.2	4-bit	`UD-IQ4_XS`	about 365GB across nodes	6.28 tok/s decode at C=1	NVIDIA forum recipe
4x GB10	full GLM-5.2 with routed-expert prune	4-bit	AWQ INT4 + pruning + MTP	custom optimized recipe	about 20-22 tok/s decode	NVIDIA forum recipe
4x DGX Spark	full GLM-5.2, not REAP	4-bit	NVFP4 hybrid	patched vLLM, tight memory assumptions	14.5-15.2 tok/s decode, 450-512 tok/s prefill	NVIDIA forum recipe
one or two M3 Ultra 512GB systems	full GLM-5.2	4-bit	NVFP4	512GB unified memory system(s)	18.8 tok/s on one, 23.4 tok/s on two; basic decode test	Ivan Fioravanti on X
2x RTX PRO 6000 Blackwell plus 1TB DDR5	full GLM-5.2	4-bit	`UD-Q4_K_XL`	1TB DDR5 in the system; not all weights live in VRAM	13-15 tok/s decode, 64 tok/s prefill	Samuel Cardillo on X
4x RTX 3090 plus 192GB DDR5	full GLM-5.2	2-bit	`UD-IQ2_M`	223GB on disk; 96GB VRAM plus host RAM	about 7.3 tok/s decode, about 135 tok/s prefill	Reddit report
2x RTX 5090 plus 512GB DDR5 ECC	full GLM-5.2	5-bit	`UD-Q5_K_S`	492GB weights; high-RAM workstation	about 12 tok/s	Reddit report
Dell PowerEdge R740, dual Xeon, 768GB RAM	full GLM-5.2	2-bit	`UD-Q2-K_XL`	CPU-only, 768GB RAM	4-5.5 tok/s basic chat, about 3 tok/s in opencode	Reddit report

The pattern is clear. Low-bit artifacts can run on varied hardware. Higher-quality artifacts require either a lot of GPU memory, a lot of unified memory, or a willingness to put host RAM in the hot path.

Cost read

The practical questions are: what can we run, how does it feel, and what does it cost to get there?

This is not a shopping guide. Prices move, used hardware is messy, and several of these runs depend on custom recipes. But the cost shape matters because "GLM-5.2 locally" can mean anything from a slow CPU experiment to a six-figure DGX Station.

Path	Rough hardware cost signal	What can run	How it feels from reports
CPU-only server with 768GB RAM	cheap if already owned; not a great reason to buy a server from scratch	low-bit full GLM-5.2	possible, but 3-5.5 tok/s is a patience test
4x used RTX 3090 plus host RAM	usually the lowest-cost GPU path in these reports	2-bit full GLM-5.2 through offload	about 7.3 tok/s decode; interesting, not luxurious
2x RTX 5090 plus 512GB RAM	high-end consumer workstation pricing, very market-dependent	5-bit full GLM-5.2 with host RAM in the story	about 12 tok/s; capacity is doing a lot of work
2x RTX PRO 6000 Blackwell plus 1TB RAM	NVIDIA marketplace listed the RTX PRO 6000 Blackwell Workstation Edition at $13,250 per GPU when we checked	4-bit full GLM-5.2	13-15 tok/s decode, 64 tok/s prefill
4x DGX Spark / GB10	NVIDIA marketplace listed DGX Spark at $4,699 when we checked, so four nodes are about $18.8k before anything else	full GLM-5.2 4-bit-class paths, depending recipe	6.28 tok/s on one reproducible llama.cpp path; higher with more custom recipes
M3 Ultra 512GB systems	a reported run, but not a clean current-buying recipe	full GLM-5.2 NVFP4	18.8 tok/s on one, 23.4 on two in a basic decode test
DGX Station	about $100k from our inquiry; some public partner listings have been in the same neighborhood	GLM-5.2-REAP 504B shown at AI Engineer; larger local frontier workloads are the question	about 60 tok/s for the reported GLM-5.2-REAP run, but missing exact quant, context, prefill, and concurrency

That table is why the DGX Station question is not just "is it expensive?" It is whether the machine can turn a much larger coherent-memory box into a better local-agent experience than the cheaper paths above. We split the DGX Station memory profile into its own post because that question depends on the 252GB HBM tier, not just the 748GB headline.

Artifact size and capacity matrix

This is the closest thing to a GLM-5.2 requirements table, but it still needs caveats.

Artifact class	Size / memory clue	Reported hardware that ran it	Capacity read
Full GLM-5.2 1-bit class	about 202-223GB in reports	2x DGX Spark, 2x M5 Max	256GB+ memory can be enough, but quality is low-bit
Full GLM-5.2 2-bit class	about 223-245GB in reports	4x3090 plus RAM, 5090+3090 plus 256GB RAM, CPU-only 768GB RAM	this is where local runs become common, but not necessarily pleasant
Full GLM-5.2 4-bit / NVFP4 / AWQ class	roughly 365-467GB depending artifact and compression	4x DGX Spark, 2x M3 Ultra, 2x RTX PRO 6000 plus 1TB RAM	this is the serious local hardware tier
Full GLM-5.2 5-bit class	about 492-570GB	2x RTX 5090 plus 512GB RAM for one report	capacity is possible; speed depends on how much leaves fast memory
Full GLM-5.2 8-bit class	about 810GB	no clean local serving report in this pass	outside most single-workstation setups
REAP 504B `Q4_K_XL`	about 325-331GB	DGX Station report may be this class, but exact quant is not confirmed	bigger than 252GB of DGX Station HBM before KV cache/runtime overhead
REAP 504B `Q3_K_XL`	about 259GB	no clean measured source in this pass	barely above 252GB before overhead, so raw size math is not enough
REAP 504B `Q2_K_XL`	about 111GB	LocalMaxxing tracks REAP GGUF and reports 7.9 tok/s top speed, but we still need the exact run table	much easier to fit, much less useful as evidence for 4-bit-quality behavior

The important part is not that the table has big numbers. The important part is that "GLM-5.2 local requirements" changes depending on which row you mean.

We also have two adoption and benchmark context points that are worth keeping separate from hardware requirements: 0xSero said the REAP work reached 15,000 downloads in 10 days and that Zai highlighted it at AI Engineer, and the REAP 504B model card reports 70.5% on Terminal-Bench 2.1 full-89. Those are not throughput numbers.

Sentdex has a practical segment starting around 23:16, and we are treating it as one case study rather than the anchor for the article. The parts worth pulling forward are narrow: he walks through the 8-bit memory math around 26:23, says Q4_K_XL is the tier he would want if it fits around 27:24, and says the Q3_K_XL tier he tried did not feel like the same class of model around 27:57.

What GLM 5.2 requirements actually mean

There are two separate requirements.

First: enough memory to load the weights.

Second: enough fast memory and runtime headroom to serve the workload.

The second requirement is the one that gets lost. A 245GB 2-bit full-model GGUF and a 325GB Q4_K_XL REAP GGUF are both "local GLM-5.2" to a search engine. They are not the same thing to a machine.

For coding agents, we will measure at least:

Metric	Why we care
model artifact	full 753B, REAP 504B, NVFP4, GGUF, 1-bit, 2-bit, 4-bit
memory placement	HBM/VRAM vs LPDDR/system RAM
prefill tokens/sec	repo-sized prompts spend time here
decode tokens/sec	visible output speed
context length	4K, 64K, 256K, and 1M are different tests
concurrency	one user and five users can expose different bottlenecks
KV cache policy	long sessions depend on this
quality	benchmark score, loop rate, tool reliability, coding pass rate

We will bring more of these numbers as we collect them. The hard part is not running one prompt. The hard part is getting comparable runs with the exact model artifact, runtime, quant, context, cache settings, and hardware disclosed.

How to run GLM 5.2 locally

If you want the full model name, start with a full-model GGUF from the Unsloth path and choose the quant tier that fits your memory budget. The public memory table says 223GB for 1-bit, 245GB for 2-bit, 372-475GB for 4-bit, and 570GB for 5-bit.

If you want the 0xSero REAP 504B GGUF path, the Hugging Face model card gives two llama.cpp shapes.

If you install through the llama.app installer shown on the model card, the command uses the unified llama launcher:

llama serve -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

or:

llama cli -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

If you install llama.cpp from a prebuilt binary, source build, or Homebrew, the executable names are usually explicit:

llama-server -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

or:

llama-cli -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XL

We verified locally that llama-cli and llama-server both accept -hf as the Hugging Face repo argument. The command does not make Q4_K_XL fit. It only names the artifact. The listed size is about 325GB before KV cache and runtime overhead.

What we would not claim yet

We would not claim that REAP 504B equals full GLM-5.2. The 70.5% Terminal-Bench 2.1 full-89 number is lower than the full-model numbers we have seen, and the harnesses are not identical.

We would not claim that a 2-bit run tells you what a 4-bit run will feel like.

We would not claim that total memory is enough. That is why the DGX Station memory question matters: DGX Station has 748GB coherent memory, but only 252GB is HBM3e.

Glossary

Term	Meaning
GGUF	A model file format commonly used by llama.cpp for local inference.
REAP	Router-weighted Expert Activation Pruning. It scores MoE experts by saliency, roughly gate weight times expert-output norm, over a calibration set. 0xSero's GLM-5.2 504B report says it keeps 168 of 256 routed experts per layer.
Quant	A quantized model artifact. Lower-bit quants use less memory, but quality and runtime behavior can change.
BPW	Bits per weight. A rough way to describe how compressed the model weights are.
BF16	16-bit brain floating point weights. Larger than 4-bit, 3-bit, or 2-bit quantized artifacts.
NVFP4	NVIDIA 4-bit floating point format used in some GLM-5.2 local reports.
AWQ	Activation-aware weight quantization, a quantization approach used in some optimized serving recipes.
MTP	Multi-token prediction. A speculative decoding technique that can improve output speed when it works well.
MLA	Multi-head latent attention. It changes memory and cache behavior compared with more conventional attention layouts.
DSA	DeepSeek sparse attention, the attention architecture referenced in several GLM-5.2 local serving recipes.
KV cache	Runtime memory used to store attention keys and values as context grows. It is separate from model weights.
Prefill	Prompt-processing speed before the model starts generating output tokens.
Decode	Output-token generation speed after prefill.
tok/s	Tokens per second. Check whether a source means decode speed, prefill speed, or aggregate output across users.
TTFT	Time to first token. Long prompts can make this matter as much as decode speed.
VRAM	GPU memory. For large local models, capacity and bandwidth both matter.
HBM	High Bandwidth Memory. This is the fast memory tier people care about on DGX Station and datacenter GPUs.
HBM3e	A generation of High Bandwidth Memory. DGX Station's fast memory tier is HBM3e.
LPDDR	Lower-power system memory. It can add capacity, but it is not the same thing as HBM or GPU VRAM.
Unified memory	Memory shared by CPU and GPU, as on Apple Silicon systems and some NVIDIA systems. It is not automatically equivalent to high-bandwidth VRAM.
Host RAM	System memory outside the GPU. Offloading model weights here can make a run possible, but often changes speed.
Concurrency	How many requests or users are served at once. A single-user tok/s number does not predict multi-user behavior.

Team of engineers preparing a rocket for launch - Ship your next feature with Cyrus

Trusted by product teams at Retool, Gamma, TinyFish, and more

Break free from the terminal

As your Claude Code powered Linear agent, Cyrus is capable of accomplishing whatever large or small issues you throw at it. Get PMs, designers and the CEO shipping product.

Start shipping 20x faster