PyTorch¶
PyTorch on the two supported targets:
- MS-S1 MAX: AMD Strix Halo iGPU (
gfx1151), ROCm 7.x. - Apple Silicon laptop: M-series GPU via the
mpsbackend.
Both paths give you a Python session that does GPU-accelerated tensor ops; the install commands and a few API names differ.
MS-S1 MAX — PyTorch on ROCm¶
Prerequisites¶
- ROCm 7.x installed and working on the host (see ROCm Installation)
- Your user is in the
videoandrendergroups - Python 3.10+ in a virtual environment (do not install ROCm wheels into the system Python)
Quick sanity check before installing:
You should see gfx1151 listed as the agent.
Install the ROCm wheels¶
PyTorch ships official ROCm wheels via its own wheel index. Pick the index URL that matches your ROCm version (check pytorch.org/get-started for the current ROCm wheel URL — it changes with each ROCm major release):
python3 -m venv ~/.venvs/torch-rocm
source ~/.venvs/torch-rocm/bin/activate
pip install --upgrade pip
# ROCm 7.1 wheels (example — verify the index URL on pytorch.org)
pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/rocm7.1
Verify the GPU is engaged¶
import torch
# On ROCm, the `torch.cuda` API namespace is reused for HIP devices.
# This is intentional — code written for CUDA generally runs unchanged.
print("CUDA-API available:", torch.cuda.is_available())
print("Device count: ", torch.cuda.device_count())
print("Device 0 name: ", torch.cuda.get_device_name(0))
# Force a real allocation + matmul to make sure HIP is actually wired up
x = torch.randn(4096, 4096, device="cuda")
y = torch.randn(4096, 4096, device="cuda")
z = (x @ y).sum().item()
print("Result OK:", z)
Expected output on the MS-S1 MAX: device 0 shows the AMD GPU (name varies with ROCm version, e.g. AMD Radeon Graphics (gfx1151)). While the matmul runs, rocm-smi on the host should show GPU utilisation > 0%.
Environment knobs that matter¶
| Variable | Use |
|---|---|
HSA_OVERRIDE_GFX_VERSION | Set to 11.5.1 if you are stuck on an older ROCm that doesn't recognise gfx1151. Not needed on ROCm 7.x. |
HIP_VISIBLE_DEVICES | Restrict which GPUs PyTorch sees. With one iGPU this is rarely useful, but it lets you sanity-check device 0. |
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True | Helps with fragmentation under long-running workloads. |
HSA_ENABLE_SDMA=0 | Workaround for occasional SDMA-related hangs on gfx1151. Try this if you see GPU hangs under heavy IO. |
Common gotchas¶
torch.cuda.*everywhere: PyTorch's ROCm wheels reuse thecudaPython module name.torch.cuda.is_available()returningTruedoes not mean CUDA is installed — it means the HIP backend found your AMD GPU.- AMP / bf16: Strix Halo supports bf16 in HIP. Use
torch.autocast("cuda", dtype=torch.bfloat16)rather than fp16 when you can; fp16 throughput on RDNA 3.5 is comparable but bf16 has better numerical headroom. - Memory pressure: the iGPU's "VRAM" is a slice of the unified memory pool, configured via the BIOS UMA frame buffer. If you get
out of memoryerrors at sizes that "should" fit, raise the UMA setting first — see Memory Configuration. - Wheel/runtime mismatch: ROCm wheels are built against a specific ROCm version. If you upgrade ROCm, re-install matching torch wheels on the same day. Mixed versions present as silent kernel launch failures.
Minimal training loop¶
import torch
from torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"
model = nn.Sequential(
nn.Linear(1024, 4096),
nn.GELU(),
nn.Linear(4096, 10),
).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()
for step in range(100):
x = torch.randn(64, 1024, device=device)
y = torch.randint(0, 10, (64,), device=device)
with torch.autocast(device, dtype=torch.bfloat16):
out = model(x)
loss = loss_fn(out, y)
opt.zero_grad(set_to_none=True)
loss.backward()
opt.step()
if step % 10 == 0:
print(step, loss.item())
Watch rocm-smi in another shell while this runs — utilisation should be well above zero.
Apple Silicon — PyTorch with MPS¶
Install¶
The standard CPU wheels include the MPS backend on macOS:
python3 -m venv ~/.venvs/torch-mps
source ~/.venvs/torch-mps/bin/activate
pip install --upgrade pip
pip install torch torchvision torchaudio
Verify¶
import torch
print("MPS available:", torch.backends.mps.is_available())
print("MPS built: ", torch.backends.mps.is_built())
x = torch.randn(4096, 4096, device="mps")
y = torch.randn(4096, 4096, device="mps")
z = (x @ y).sum().item()
print("Result OK:", z)
While the matmul runs, sudo powermetrics --samplers gpu_power -i 1000 should show GPU power increasing.
MPS gotchas¶
- Some ops still fall back to CPU. Set
PYTORCH_ENABLE_MPS_FALLBACK=1if you want PyTorch to fall back to CPU silently instead of raising; otherwise the error tells you which op is missing. float64is not supported on MPS. Cast tofloat32before sending tensors to the device.bf16support has improved across recent PyTorch releases — verify withtorch.tensor([1.0], dtype=torch.bfloat16, device="mps")if you intend to use it.
Portable device-selection pattern¶
A single block that works on both targets and on CPU:
import torch
def pick_device() -> str:
if torch.cuda.is_available(): # ROCm (MS-S1 MAX) or CUDA
return "cuda"
if torch.backends.mps.is_available(): # Apple Silicon
return "mps"
return "cpu"
device = pick_device()
print("Using:", device)
See also¶
- PyTorch get-started page
- ROCm Installation
- Memory Configuration
- Inference Engines - if you only need inference, llama.cpp/Ollama are simpler than running PyTorch directly