Hardware Architecture¶
Deep dive into the AMD Strix Point APU architecture and why it's well-suited for local AI inference.
APU Overview¶
An APU (Accelerated Processing Unit) combines CPU and GPU on a single die, sharing a unified memory pool. Unlike discrete GPU systems where the GPU has dedicated VRAM accessed over PCIe, the APU's integrated graphics directly accesses system RAM.
Traditional Discrete GPU Setup:
+-------------+ PCIe x16 +------------------+
| CPU |<----------------->| GPU |
| (System) | ~32 GB/s | (Discrete) |
+------+------+ +--------+---------+
| |
v v
+-------------+ +------------------+
| System RAM | | VRAM (GDDR6X) |
| 64-128GB | | 24GB max |
| ~90 GB/s | | ~1 TB/s |
+-------------+ +------------------+
APU Architecture (Strix Point):
+--------------------------------------------------+
| AMD Ryzen AI Max+ 395 |
| +-------------+ +------------------+ |
| | Zen 5 | | RDNA 3.5 | |
| | CPU | | GPU | |
| | 16 cores | | 40 CUs | |
| +------+------+ +--------+---------+ |
| | On-die Infinity Fabric | |
| +------------+-------------+ |
| | |
| +------------v-------------+ |
| | Memory Controller | |
| +--------------------------+ |
+--------------------------------------------------+
|
v
+------------------------+
| DDR5-5600 (128GB) |
| Dual-channel |
| ~90 GB/s bandwidth |
+------------------------+
Strix Point Architecture¶
The Ryzen AI Max+ 395 is built on the Strix Point platform:
CPU Complex¶
| Specification | Detail |
|---|---|
| Architecture | Zen 5 |
| Cores | 16 |
| Threads | 32 |
| L2 Cache | 16MB (1MB/core) |
| L3 Cache | 64MB shared |
| Base Clock | 3.0 GHz |
| Boost Clock | Up to 5.1 GHz |
GPU Complex¶
| Specification | Detail |
|---|---|
| Architecture | RDNA 3.5 |
| Compute Units | 40 CUs |
| Stream Processors | 2560 |
| GPU ID | gfx1151 |
| ROCm Support | Experimental (requires HSA_OVERRIDE_GFX_VERSION) |
| Ray Accelerators | 40 |
Memory Subsystem¶
| Specification | Detail |
|---|---|
| Type | DDR5-5600 |
| Channels | Dual-channel |
| Maximum Capacity | 128GB (2x64GB) |
| Theoretical Bandwidth | 89.6 GB/s |
| Practical Bandwidth | ~70-80 GB/s |
Bandwidth Analysis¶
Memory bandwidth directly affects LLM inference speed. Each token requires reading the entire model from memory:
Token generation rate = Memory Bandwidth / Model Size
Example with 70B Q4 model (~35GB):
- DDR5-5600: 80 GB/s / 35GB = ~2.3 reads/sec = ~2.3 tokens/sec (theoretical)
- Actual: 8-15 tokens/sec (parallelism, caching, batching help)
Example with 70B Q8 model (~70GB):
- DDR5-5600: 80 GB/s / 70GB = ~1.1 reads/sec base
- Actual: 5-10 tokens/sec
Bandwidth Comparison¶
| Memory Type | Theoretical | Practical | Use Case |
|---|---|---|---|
| DDR5-5600 (MS-S1 MAX) | 89.6 GB/s | ~75 GB/s | Large models, slow inference |
| GDDR6X (RTX 4090) | 1008 GB/s | ~900 GB/s | Small models, fast inference |
| HBM3 (H100) | 3350 GB/s | ~3000 GB/s | Enterprise inference |
| Unified (M4 Max) | 546 GB/s | ~400 GB/s | Balanced approach |
The MS-S1 MAX trades speed for capacity. A 70B model at Q6 (52GB) runs entirely in memory - impossible on a 24GB discrete GPU without CPU offloading (which creates its own bandwidth bottleneck over PCIe at ~32 GB/s).
Platform Comparison¶
| Aspect | MS-S1 MAX | Mac Studio M4 Max | GPU Workstation |
|---|---|---|---|
| Memory | 128GB DDR5 | 128GB Unified | 64GB + 24GB VRAM |
| GPU Memory | Shared 128GB | Shared 128GB | 24GB dedicated |
| Memory Bandwidth | ~90 GB/s | ~546 GB/s | ~90 + 1000 GB/s |
| Max Model (Q4) | 200B+ | 200B+ | 45B (GPU only) |
| Max Model (Q8) | 100B+ | 100B+ | 22B (GPU only) |
| Power Draw | 55-120W | 40-120W | 200-600W |
| OS | Linux (ROCm) | macOS (Metal) | Linux (CUDA) |
| Price (2024) | ~$2000 | ~$4000 | ~$5000+ |
Unified Memory Explained¶
In a discrete GPU system, data must be copied between system RAM and VRAM:
- Load model into system RAM
- Copy relevant portions to VRAM (limited by VRAM size)
- Run inference on GPU
- Copy results back to system RAM
This creates the "VRAM wall" - models larger than VRAM must be split across CPU and GPU, with PCIe becoming the bottleneck.
With unified memory:
- Load model into RAM
- Both CPU and GPU access the same memory directly
- No copying, no PCIe bottleneck
- Entire 128GB available for models
The trade-off is bandwidth - DDR5 is slower than GDDR6X. But for large models, having the model fit entirely in GPU-accessible memory is more important than raw bandwidth.
Thermal Considerations¶
The MS-S1 MAX manages thermals through:
- Active cooling: Dual-fan system with vapor chamber
- TDP configuration: BIOS-adjustable from 55W to 120W
- Throttling behavior: CPU/GPU reduce clocks under thermal pressure
For sustained AI workloads:
- Ambient temperature matters significantly
- 55W TDP provides stable, quiet operation
- 120W TDP for maximum performance with higher noise
- GPU-heavy workloads (inference) stress cooling more than CPU-heavy
Practical Implications¶
What Works Well¶
- 70B models at Q4-Q6: Core use case, fits comfortably
- 405B models at Q2-Q3: Fits in memory, slow but functional
- Multiple smaller models: Can keep several 7B-13B models loaded
- Long context: Memory capacity allows large context windows
Limitations¶
- Speed: Expect 5-15 tokens/sec for 70B models
- ROCm support: Strix Point requires environment variable overrides
- No tensor cores: RDNA 3.5 lacks dedicated AI accelerators
- Single GPU: Cannot scale with additional GPUs
Related Documentation¶
- Hardware - System specifications
- BIOS Setup - Optimizing BIOS settings for AI workloads
- GPU Setup - ROCm installation and configuration
- Memory Configuration - UMA frame buffer settings