GLM 5.2 local hardware requirements: reported paths by quant

Most "GLM-5.2 local" discussion mixes together different things.
We should separate them before talking about requirements.
- the full zai-org/GLM-5.2 model
- low-bit GGUF builds of the full model
- pruned derivatives like 0xSero/GLM-5.2-504B
- GGUF builds of that pruned derivative, like 0xSero/GLM-5.2-REAP-504B-GGUF
Those are not the same hardware problem.
At Cyrus, we care about this because coding agents are not short chat demos. They keep repository context around, call tools, recover from mistakes, and spend a lot of time in long prompts. For that workload, "the model loaded" is not the same as "the model is usable."
Reported hardware paths
If someone says "GLM-5.2 runs locally," the next question is: which artifact, which quant, and where are the weights actually sitting?
These are the GLM-5.2-family reports we have been able to tie to specific hardware. They are not all the same benchmark.
| Hardware | Model | Quantization | Technical identifier | Size / memory clue | Reported feel | Source | |
|---|---|---|---|---|---|---|---|
![]() |
DGX Station | GLM-5.2-REAP 504B | unknown | GLM-5.2-REAP-504B-GGUF; exact quant not confirmed |
if Q4_K_XL, card lists about 325-331GB; if smaller, conclusion changes |
about 60 tok/s; no prefill, context, or concurrency yet | 0xSero on X |
![]() |
4x DGX Spark / GB10 | full GLM-5.2 | 4-bit | UD-IQ4_XS |
about 365GB across nodes | 6.28 tok/s decode at C=1 | NVIDIA forum recipe |
| 4x GB10 | full GLM-5.2 with routed-expert prune | 4-bit | AWQ INT4 + pruning + MTP | custom optimized recipe | about 20-22 tok/s decode | NVIDIA forum recipe | |
| 4x DGX Spark | full GLM-5.2, not REAP | 4-bit | NVFP4 hybrid | patched vLLM, tight memory assumptions | 14.5-15.2 tok/s decode, 450-512 tok/s prefill | NVIDIA forum recipe | |
| one or two M3 Ultra 512GB systems | full GLM-5.2 | 4-bit | NVFP4 | 512GB unified memory system(s) | 18.8 tok/s on one, 23.4 tok/s on two; basic decode test | Ivan Fioravanti on X | |
| 2x RTX PRO 6000 Blackwell plus 1TB DDR5 | full GLM-5.2 | 4-bit | UD-Q4_K_XL |
1TB DDR5 in the system; not all weights live in VRAM | 13-15 tok/s decode, 64 tok/s prefill | Samuel Cardillo on X | |
| 4x RTX 3090 plus 192GB DDR5 | full GLM-5.2 | 2-bit | UD-IQ2_M |
223GB on disk; 96GB VRAM plus host RAM | about 7.3 tok/s decode, about 135 tok/s prefill | Reddit report | |
| 2x RTX 5090 plus 512GB DDR5 ECC | full GLM-5.2 | 5-bit | UD-Q5_K_S |
492GB weights; high-RAM workstation | about 12 tok/s | Reddit report | |
| Dell PowerEdge R740, dual Xeon, 768GB RAM | full GLM-5.2 | 2-bit | UD-Q2-K_XL |
CPU-only, 768GB RAM | 4-5.5 tok/s basic chat, about 3 tok/s in opencode | Reddit report |
The pattern is clear. Low-bit artifacts can run on varied hardware. Higher-quality artifacts require either a lot of GPU memory, a lot of unified memory, or a willingness to put host RAM in the hot path.
Cost read
The practical questions are: what can we run, how does it feel, and what does it cost to get there?
This is not a shopping guide. Prices move, used hardware is messy, and several of these runs depend on custom recipes. But the cost shape matters because "GLM-5.2 locally" can mean anything from a slow CPU experiment to a six-figure DGX Station.
| Path | Rough hardware cost signal | What can run | How it feels from reports |
|---|---|---|---|
| CPU-only server with 768GB RAM | cheap if already owned; not a great reason to buy a server from scratch | low-bit full GLM-5.2 | possible, but 3-5.5 tok/s is a patience test |
| 4x used RTX 3090 plus host RAM | usually the lowest-cost GPU path in these reports | 2-bit full GLM-5.2 through offload | about 7.3 tok/s decode; interesting, not luxurious |
| 2x RTX 5090 plus 512GB RAM | high-end consumer workstation pricing, very market-dependent | 5-bit full GLM-5.2 with host RAM in the story | about 12 tok/s; capacity is doing a lot of work |
| 2x RTX PRO 6000 Blackwell plus 1TB RAM | NVIDIA marketplace listed the RTX PRO 6000 Blackwell Workstation Edition at $13,250 per GPU when we checked | 4-bit full GLM-5.2 | 13-15 tok/s decode, 64 tok/s prefill |
| 4x DGX Spark / GB10 | NVIDIA marketplace listed DGX Spark at $4,699 when we checked, so four nodes are about $18.8k before anything else | full GLM-5.2 4-bit-class paths, depending recipe | 6.28 tok/s on one reproducible llama.cpp path; higher with more custom recipes |
| M3 Ultra 512GB systems | a reported run, but not a clean current-buying recipe | full GLM-5.2 NVFP4 | 18.8 tok/s on one, 23.4 on two in a basic decode test |
| DGX Station | about $100k from our inquiry; some public partner listings have been in the same neighborhood | GLM-5.2-REAP 504B shown at AI Engineer; larger local frontier workloads are the question | about 60 tok/s for the reported GLM-5.2-REAP run, but missing exact quant, context, prefill, and concurrency |
That table is why the DGX Station question is not just "is it expensive?" It is whether the machine can turn a much larger coherent-memory box into a better local-agent experience than the cheaper paths above. We split the DGX Station memory profile into its own post because that question depends on the 252GB HBM tier, not just the 748GB headline.
Artifact size and capacity matrix
This is the closest thing to a GLM-5.2 requirements table, but it still needs caveats.
| Artifact class | Size / memory clue | Reported hardware that ran it | Capacity read |
|---|---|---|---|
| Full GLM-5.2 1-bit class | about 202-223GB in reports | 2x DGX Spark, 2x M5 Max | 256GB+ memory can be enough, but quality is low-bit |
| Full GLM-5.2 2-bit class | about 223-245GB in reports | 4x3090 plus RAM, 5090+3090 plus 256GB RAM, CPU-only 768GB RAM | this is where local runs become common, but not necessarily pleasant |
| Full GLM-5.2 4-bit / NVFP4 / AWQ class | roughly 365-467GB depending artifact and compression | 4x DGX Spark, 2x M3 Ultra, 2x RTX PRO 6000 plus 1TB RAM | this is the serious local hardware tier |
| Full GLM-5.2 5-bit class | about 492-570GB | 2x RTX 5090 plus 512GB RAM for one report | capacity is possible; speed depends on how much leaves fast memory |
| Full GLM-5.2 8-bit class | about 810GB | no clean local serving report in this pass | outside most single-workstation setups |
REAP 504B Q4_K_XL | about 325-331GB | DGX Station report may be this class, but exact quant is not confirmed | bigger than 252GB of DGX Station HBM before KV cache/runtime overhead |
REAP 504B Q3_K_XL | about 259GB | no clean measured source in this pass | barely above 252GB before overhead, so raw size math is not enough |
REAP 504B Q2_K_XL | about 111GB | LocalMaxxing tracks REAP GGUF and reports 7.9 tok/s top speed, but we still need the exact run table | much easier to fit, much less useful as evidence for 4-bit-quality behavior |
The important part is not that the table has big numbers. The important part is that "GLM-5.2 local requirements" changes depending on which row you mean.
We also have two adoption and benchmark context points that are worth keeping separate from hardware requirements: 0xSero said the REAP work reached 15,000 downloads in 10 days and that Zai highlighted it at AI Engineer, and the REAP 504B model card reports 70.5% on Terminal-Bench 2.1 full-89. Those are not throughput numbers.
Sentdex has a practical segment starting around 23:16, and we are treating it as one case study rather than the anchor for the article. The parts worth pulling forward are narrow: he walks through the 8-bit memory math around 26:23, says Q4_K_XL is the tier he would want if it fits around 27:24, and says the Q3_K_XL tier he tried did not feel like the same class of model around 27:57.
What GLM 5.2 requirements actually mean
There are two separate requirements.
First: enough memory to load the weights.
Second: enough fast memory and runtime headroom to serve the workload.
The second requirement is the one that gets lost. A 245GB 2-bit full-model GGUF and a 325GB Q4_K_XL REAP GGUF are both "local GLM-5.2" to a search engine. They are not the same thing to a machine.
For coding agents, we will measure at least:
| Metric | Why we care |
|---|---|
| model artifact | full 753B, REAP 504B, NVFP4, GGUF, 1-bit, 2-bit, 4-bit |
| memory placement | HBM/VRAM vs LPDDR/system RAM |
| prefill tokens/sec | repo-sized prompts spend time here |
| decode tokens/sec | visible output speed |
| context length | 4K, 64K, 256K, and 1M are different tests |
| concurrency | one user and five users can expose different bottlenecks |
| KV cache policy | long sessions depend on this |
| quality | benchmark score, loop rate, tool reliability, coding pass rate |
We will bring more of these numbers as we collect them. The hard part is not running one prompt. The hard part is getting comparable runs with the exact model artifact, runtime, quant, context, cache settings, and hardware disclosed.
How to run GLM 5.2 locally
If you want the full model name, start with a full-model GGUF from the Unsloth path and choose the quant tier that fits your memory budget. The public memory table says 223GB for 1-bit, 245GB for 2-bit, 372-475GB for 4-bit, and 570GB for 5-bit.
If you want the 0xSero REAP 504B GGUF path, the Hugging Face model card gives two llama.cpp shapes.
If you install through the llama.app installer shown on the model card, the command uses the unified llama launcher:
llama serve -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XLor:
llama cli -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XLIf you install llama.cpp from a prebuilt binary, source build, or Homebrew, the executable names are usually explicit:
llama-server -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XLor:
llama-cli -hf 0xSero/GLM-5.2-REAP-504B-GGUF:Q4_K_XLWe verified locally that llama-cli and llama-server both accept -hf as the Hugging Face repo argument. The command does not make Q4_K_XL fit. It only names the artifact. The listed size is about 325GB before KV cache and runtime overhead.
What we would not claim yet
We would not claim that REAP 504B equals full GLM-5.2. The 70.5% Terminal-Bench 2.1 full-89 number is lower than the full-model numbers we have seen, and the harnesses are not identical.
We would not claim that a 2-bit run tells you what a 4-bit run will feel like.
We would not claim that total memory is enough. That is why the DGX Station memory question matters: DGX Station has 748GB coherent memory, but only 252GB is HBM3e.
Glossary
| Term | Meaning |
|---|---|
| GGUF | A model file format commonly used by llama.cpp for local inference. |
| REAP | Router-weighted Expert Activation Pruning. It scores MoE experts by saliency, roughly gate weight times expert-output norm, over a calibration set. 0xSero's GLM-5.2 504B report says it keeps 168 of 256 routed experts per layer. |
| Quant | A quantized model artifact. Lower-bit quants use less memory, but quality and runtime behavior can change. |
| BPW | Bits per weight. A rough way to describe how compressed the model weights are. |
| BF16 | 16-bit brain floating point weights. Larger than 4-bit, 3-bit, or 2-bit quantized artifacts. |
| NVFP4 | NVIDIA 4-bit floating point format used in some GLM-5.2 local reports. |
| AWQ | Activation-aware weight quantization, a quantization approach used in some optimized serving recipes. |
| MTP | Multi-token prediction. A speculative decoding technique that can improve output speed when it works well. |
| MLA | Multi-head latent attention. It changes memory and cache behavior compared with more conventional attention layouts. |
| DSA | DeepSeek sparse attention, the attention architecture referenced in several GLM-5.2 local serving recipes. |
| KV cache | Runtime memory used to store attention keys and values as context grows. It is separate from model weights. |
| Prefill | Prompt-processing speed before the model starts generating output tokens. |
| Decode | Output-token generation speed after prefill. |
| tok/s | Tokens per second. Check whether a source means decode speed, prefill speed, or aggregate output across users. |
| TTFT | Time to first token. Long prompts can make this matter as much as decode speed. |
| VRAM | GPU memory. For large local models, capacity and bandwidth both matter. |
| HBM | High Bandwidth Memory. This is the fast memory tier people care about on DGX Station and datacenter GPUs. |
| HBM3e | A generation of High Bandwidth Memory. DGX Station's fast memory tier is HBM3e. |
| LPDDR | Lower-power system memory. It can add capacity, but it is not the same thing as HBM or GPU VRAM. |
| Unified memory | Memory shared by CPU and GPU, as on Apple Silicon systems and some NVIDIA systems. It is not automatically equivalent to high-bandwidth VRAM. |
| Host RAM | System memory outside the GPU. Offloading model weights here can make a run possible, but often changes speed. |
| Concurrency | How many requests or users are served at once. A single-user tok/s number does not predict multi-user behavior. |

Break free from the terminal
As your Claude Code powered Linear agent, Cyrus is capable of accomplishing whatever large or small issues you throw at it. Get PMs, designers and the CEO shipping product.

