Fig. 6.0 · Compatibility
Every GPU. Every CPU.
Every edge box.
Runtime probes the hardware on first start. Picks the engine. Serves the API. You do not configure anything.
[ Support levels ]
Three levels of support.
Production: running for customers today. Supported: works, benchmarked. Experimental: boots, validating. Talk to us before committing to experimental hardware.
Production
NVIDIA, CPU, Apple Silicon
H100, H200, A100, L4, L40S, A10, RTX 30/40 series, Jetson Orin family, x86/ARM CPUs, M-series Macs.
Supported
AMD, Intel accelerators
MI300X, MI250X via ROCm. Intel Gaudi 2/3 via Habana. Intel Arc/Xe via OpenVINO.
Experimental
TPU, niche accelerators
Google TPU v4/v5e, Hailo, AWS Trainium/Inferentia, Qualcomm Cloud AI 100. Validation ongoing.
[ NVIDIA ]
NVIDIA
The most tested path. CUDA is the primary target across vLLM, TensorRT-LLM, and llama.cpp. The majority of Sector88 deployments run on NVIDIA hardware.
| Generation | Cards | VRAM | Default engine | Status |
|---|---|---|---|---|
| Hopper | H100, H200 | 80-141 GB | vLLM / TRT-LLM | Production |
| Blackwell | B100, B200, GB200 | 192+ GB | vLLM / TRT-LLM | Supported |
| Ampere | A100, A40, A30, A10, A2 | 24-80 GB | vLLM | Production |
| Ada Lovelace | L4, L40, L40S, RTX 4090/4080 | 24-48 GB | vLLM (FP8) | Production |
| Turing / consumer Ampere | RTX 30/20 series, T4 | 8-24 GB | vLLM / llama.cpp | Production |
| Jetson (edge) | Orin AGX, Orin NX, Orin Nano, Xavier | 8-64 GB unified | llama.cpp / TRT-LLM | Production |
| DGX / HGX systems | DGX H100, HGX H200, DGX B200 | 640+ GB aggregate | TRT-LLM (TP/PP) | Production |
Minimum: CUDA 12.1+. Older Pascal/Volta hardware (P100, V100) runs via llama.cpp but throughput is limited.
[ AMD ]
AMD
ROCm is a first-class target for vLLM. MI300X packs 192GB HBM3 per card, often better price-per-token than H100. We absorb the ROCm complexity.
| Family | Cards | VRAM | Engine | Status |
|---|---|---|---|---|
| Instinct (datacentre) | MI300X, MI300A, MI250X, MI210 | 64-192 GB | vLLM (ROCm) | Supported |
| Radeon (workstation) | RX 7900 XTX, W7900 | 24-48 GB | llama.cpp / vLLM | Supported |
| Ryzen AI (laptop/mini-PC) | 8000-series APU + NPU | System RAM | llama.cpp | Experimental |
ROCm 6.0+ on Linux. Tensor parallelism and FP8/FP16 on MI300X.
[ Intel ]
Intel
Gaudi for high-throughput inference. Xeon AMX for CPU-only. Arc/Xe for workstation. The right call when supply chain or compliance requires Intel.
| Family | Hardware | Engine | Best for | Status |
|---|---|---|---|---|
| Gaudi | Gaudi 2, Gaudi 3 | vLLM (HPU plugin) | High-throughput datacentre, NVIDIA alternative | Supported |
| Xeon (CPU) | Sapphire Rapids, Emerald Rapids, Granite Rapids | llama.cpp (AMX) | CPU-only inference, AMX accelerated | Production |
| Arc / Xe | Arc A770, Xe-LP/HPG | llama.cpp / OpenVINO | Workstation, secondary fleet | Supported |
| Core Ultra (NPU) | Meteor Lake, Lunar Lake, Arrow Lake | OpenVINO | Laptop, light edge | Experimental |
[ Everything else ]
Apple, ARM, CPU-only, niche.
llama.cpp runs on almost anything with a C compiler. Runtime selects it automatically when no GPU is present.
Apple Silicon
M1 / M2 / M3 / M4
Unified memory architecture is excellent for inference. A Mac Studio with 192GB unified memory runs 70B models comfortably via Metal-accelerated llama.cpp.
Engine: llama.cpp · Status: Production
ARM servers
Ampere Altra, AWS Graviton, NVIDIA Grace
SVE/SVE2 vector extensions supported in llama.cpp. NVIDIA Grace pairs ARM cores with Hopper GPUs for unified-memory inference.
Engine: llama.cpp · Status: Supported
x86 CPU-only
EPYC, Xeon, Ryzen
For sites with no GPU at all. Runs 3B to 8B quantized models at usable interactive rates. Often the right fit for compliance-constrained environments.
Engine: llama.cpp (AVX-512/AMX) · Status: Production
Niche / under validation
TPU, Trainium, Hailo, Cloud AI 100
Supported via vendor-specific runtimes (vLLM-TPU, Neuron SDK, Hailo SDK). Talk to us if your stack requires one of these.
Status: Experimental
[ Sizing ]
Model size vs. memory.
First-pass sizing guide. Use the calculator for precise answers on your specific workload.
| Model size | FP16 weights | INT8 | INT4 (Q4_K_M) | Minimum card |
|---|---|---|---|---|
| 1B | ~2 GB | ~1 GB | ~0.7 GB | Jetson Nano, Apple M-series, any modern laptop |
| 7-8B | ~14-16 GB | ~7-8 GB | ~4-5 GB | L4 24GB, RTX 3090/4090, A10, M2/M3 Pro+ |
| 13B | ~26 GB | ~13 GB | ~7-8 GB | L40S 48GB, A100 40GB, RTX 4090 (quantized) |
| 70B | ~140 GB | ~70 GB | ~40 GB | 2x A100 80GB, H100, MI300X, Mac Studio 192GB |
| 405B | ~810 GB | ~405 GB | ~230 GB | 8x H100, DGX H200, MI300X cluster |
Add 10-25% on top of weights for KV-cache at your target context length. Sector88 memory tiering recovers most of this and lets you exceed the naive VRAM limit, especially for batched serving.
Tell us your hardware.
Send us your spec and the model you want to run. We will tell you honestly whether it fits, how fast it will run, and which tier applies to your configuration.