Fig. 6.0 · Compatibility

Every GPU. Every CPU. Every edge box.

Runtime probes the hardware on first start. Picks the engine. Serves the API. You do not configure anything.

[ Support levels ]

Three levels of support.

Production: running for customers today. Supported: works, benchmarked. Experimental: boots, validating. Talk to us before committing to experimental hardware.

Deployment environments

Production

NVIDIA, CPU, Apple Silicon

H100, H200, A100, L4, L40S, A10, RTX 30/40 series, Jetson Orin family, x86/ARM CPUs, M-series Macs.

Supported

AMD, Intel accelerators

MI300X, MI250X via ROCm. Intel Gaudi 2/3 via Habana. Intel Arc/Xe via OpenVINO.

Experimental

TPU, niche accelerators

Google TPU v4/v5e, Hailo, AWS Trainium/Inferentia, Qualcomm Cloud AI 100. Validation ongoing.

[ NVIDIA ]

NVIDIA

The most tested path. CUDA is the primary target across vLLM, TensorRT-LLM, and llama.cpp. The majority of Sector88 deployments run on NVIDIA hardware.

Runtime architecture
Generation Cards VRAM Default engine Status
Hopper H100, H200 80-141 GB vLLM / TRT-LLM Production
Blackwell B100, B200, GB200 192+ GB vLLM / TRT-LLM Supported
Ampere A100, A40, A30, A10, A2 24-80 GB vLLM Production
Ada Lovelace L4, L40, L40S, RTX 4090/4080 24-48 GB vLLM (FP8) Production
Turing / consumer Ampere RTX 30/20 series, T4 8-24 GB vLLM / llama.cpp Production
Jetson (edge) Orin AGX, Orin NX, Orin Nano, Xavier 8-64 GB unified llama.cpp / TRT-LLM Production
DGX / HGX systems DGX H100, HGX H200, DGX B200 640+ GB aggregate TRT-LLM (TP/PP) Production

Minimum: CUDA 12.1+. Older Pascal/Volta hardware (P100, V100) runs via llama.cpp but throughput is limited.

[ AMD ]

AMD

ROCm is a first-class target for vLLM. MI300X packs 192GB HBM3 per card, often better price-per-token than H100. We absorb the ROCm complexity.

Memory tier diagram
Family Cards VRAM Engine Status
Instinct (datacentre) MI300X, MI300A, MI250X, MI210 64-192 GB vLLM (ROCm) Supported
Radeon (workstation) RX 7900 XTX, W7900 24-48 GB llama.cpp / vLLM Supported
Ryzen AI (laptop/mini-PC) 8000-series APU + NPU System RAM llama.cpp Experimental

ROCm 6.0+ on Linux. Tensor parallelism and FP8/FP16 on MI300X.

[ Intel ]

Intel

Gaudi for high-throughput inference. Xeon AMX for CPU-only. Arc/Xe for workstation. The right call when supply chain or compliance requires Intel.

Family Hardware Engine Best for Status
Gaudi Gaudi 2, Gaudi 3 vLLM (HPU plugin) High-throughput datacentre, NVIDIA alternative Supported
Xeon (CPU) Sapphire Rapids, Emerald Rapids, Granite Rapids llama.cpp (AMX) CPU-only inference, AMX accelerated Production
Arc / Xe Arc A770, Xe-LP/HPG llama.cpp / OpenVINO Workstation, secondary fleet Supported
Core Ultra (NPU) Meteor Lake, Lunar Lake, Arrow Lake OpenVINO Laptop, light edge Experimental

[ Everything else ]

Apple, ARM, CPU-only, niche.

llama.cpp runs on almost anything with a C compiler. Runtime selects it automatically when no GPU is present.

Memory orchestration

Apple Silicon

M1 / M2 / M3 / M4

Unified memory architecture is excellent for inference. A Mac Studio with 192GB unified memory runs 70B models comfortably via Metal-accelerated llama.cpp.

Engine: llama.cpp · Status: Production

ARM servers

Ampere Altra, AWS Graviton, NVIDIA Grace

SVE/SVE2 vector extensions supported in llama.cpp. NVIDIA Grace pairs ARM cores with Hopper GPUs for unified-memory inference.

Engine: llama.cpp · Status: Supported

x86 CPU-only

EPYC, Xeon, Ryzen

For sites with no GPU at all. Runs 3B to 8B quantized models at usable interactive rates. Often the right fit for compliance-constrained environments.

Engine: llama.cpp (AVX-512/AMX) · Status: Production

Niche / under validation

TPU, Trainium, Hailo, Cloud AI 100

Supported via vendor-specific runtimes (vLLM-TPU, Neuron SDK, Hailo SDK). Talk to us if your stack requires one of these.

Status: Experimental

[ Sizing ]

Model size vs. memory.

First-pass sizing guide. Use the calculator for precise answers on your specific workload.

API and engine selection
Model size FP16 weights INT8 INT4 (Q4_K_M) Minimum card
1B ~2 GB ~1 GB ~0.7 GB Jetson Nano, Apple M-series, any modern laptop
7-8B ~14-16 GB ~7-8 GB ~4-5 GB L4 24GB, RTX 3090/4090, A10, M2/M3 Pro+
13B ~26 GB ~13 GB ~7-8 GB L40S 48GB, A100 40GB, RTX 4090 (quantized)
70B ~140 GB ~70 GB ~40 GB 2x A100 80GB, H100, MI300X, Mac Studio 192GB
405B ~810 GB ~405 GB ~230 GB 8x H100, DGX H200, MI300X cluster

Add 10-25% on top of weights for KV-cache at your target context length. Sector88 memory tiering recovers most of this and lets you exceed the naive VRAM limit, especially for batched serving.

Tell us your hardware.

Send us your spec and the model you want to run. We will tell you honestly whether it fits, how fast it will run, and which tier applies to your configuration.