Fig. 6.0 · Compatibility

Every GPU. Every CPU.
Every edge box.

Runtime probes the hardware on first start. Picks the engine. Serves the API. You do not configure anything.

Check your hardware Ask about your config

[ Support levels ]

Three levels of support.

Production: running for customers today. Supported: works, benchmarked. Experimental: boots, validating. Talk to us before committing to experimental hardware.

Production

NVIDIA, CPU, Apple Silicon

H100, H200, A100, L4, L40S, A10, RTX 30/40 series, Jetson Orin family, x86/ARM CPUs, M-series Macs.

Supported

AMD, Intel accelerators

MI300X, MI250X via ROCm. Intel Gaudi 2/3 via Habana. Intel Arc/Xe via OpenVINO.

Experimental

TPU, niche accelerators

Google TPU v4/v5e, Hailo, AWS Trainium/Inferentia, Qualcomm Cloud AI 100. Validation ongoing.

[ NVIDIA ]

NVIDIA

The most tested path. CUDA is the primary target across vLLM, TensorRT-LLM, and llama.cpp. The majority of Sector88 deployments run on NVIDIA hardware.

Generation	Cards	VRAM	Default engine	Status
Hopper	H100, H200	80-141 GB	vLLM / TRT-LLM	Production
Blackwell	B100, B200, GB200	192+ GB	vLLM / TRT-LLM	Supported
Ampere	A100, A40, A30, A10, A2	24-80 GB	vLLM	Production
Ada Lovelace	L4, L40, L40S, RTX 4090/4080	24-48 GB	vLLM (FP8)	Production
Turing / consumer Ampere	RTX 30/20 series, T4	8-24 GB	vLLM / llama.cpp	Production
Jetson (edge)	Orin AGX, Orin NX, Orin Nano, Xavier	8-64 GB unified	llama.cpp / TRT-LLM	Production
DGX / HGX systems	DGX H100, HGX H200, DGX B200	640+ GB aggregate	TRT-LLM (TP/PP)	Production

Minimum: CUDA 12.1+. Older Pascal/Volta hardware (P100, V100) runs via llama.cpp but throughput is limited.

[ AMD ]

AMD

ROCm is a first-class target for vLLM. MI300X packs 192GB HBM3 per card, often better price-per-token than H100. We absorb the ROCm complexity.

Family	Cards	VRAM	Engine	Status
Instinct (datacentre)	MI300X, MI300A, MI250X, MI210	64-192 GB	vLLM (ROCm)	Supported
Radeon (workstation)	RX 7900 XTX, W7900	24-48 GB	llama.cpp / vLLM	Supported
Ryzen AI (laptop/mini-PC)	8000-series APU + NPU	System RAM	llama.cpp	Experimental

ROCm 6.0+ on Linux. Tensor parallelism and FP8/FP16 on MI300X.

[ Intel ]

Intel

Gaudi for high-throughput inference. Xeon AMX for CPU-only. Arc/Xe for workstation. The right call when supply chain or compliance requires Intel.

Family	Hardware	Engine	Best for	Status
Gaudi	Gaudi 2, Gaudi 3	vLLM (HPU plugin)	High-throughput datacentre, NVIDIA alternative	Supported
Xeon (CPU)	Sapphire Rapids, Emerald Rapids, Granite Rapids	llama.cpp (AMX)	CPU-only inference, AMX accelerated	Production
Arc / Xe	Arc A770, Xe-LP/HPG	llama.cpp / OpenVINO	Workstation, secondary fleet	Supported
Core Ultra (NPU)	Meteor Lake, Lunar Lake, Arrow Lake	OpenVINO	Laptop, light edge	Experimental

[ Everything else ]

Apple, ARM, CPU-only, niche.

llama.cpp runs on almost anything with a C compiler. Runtime selects it automatically when no GPU is present.

Apple Silicon

M1 / M2 / M3 / M4

Unified memory architecture is excellent for inference. A Mac Studio with 192GB unified memory runs 70B models comfortably via Metal-accelerated llama.cpp.

Engine: llama.cpp · Status: Production

ARM servers

Ampere Altra, AWS Graviton, NVIDIA Grace

SVE/SVE2 vector extensions supported in llama.cpp. NVIDIA Grace pairs ARM cores with Hopper GPUs for unified-memory inference.

Engine: llama.cpp · Status: Supported

x86 CPU-only

EPYC, Xeon, Ryzen

For sites with no GPU at all. Runs 3B to 8B quantized models at usable interactive rates. Often the right fit for compliance-constrained environments.

Engine: llama.cpp (AVX-512/AMX) · Status: Production

Niche / under validation

TPU, Trainium, Hailo, Cloud AI 100

Supported via vendor-specific runtimes (vLLM-TPU, Neuron SDK, Hailo SDK). Talk to us if your stack requires one of these.

Status: Experimental

[ Sizing ]

Model size vs. memory.

First-pass sizing guide. Use the calculator for precise answers on your specific workload.

Model size	FP16 weights	INT8	INT4 (Q4_K_M)	Minimum card
1B	~2 GB	~1 GB	~0.7 GB	Jetson Nano, Apple M-series, any modern laptop
7-8B	~14-16 GB	~7-8 GB	~4-5 GB	L4 24GB, RTX 3090/4090, A10, M2/M3 Pro+
13B	~26 GB	~13 GB	~7-8 GB	L40S 48GB, A100 40GB, RTX 4090 (quantized)
70B	~140 GB	~70 GB	~40 GB	2x A100 80GB, H100, MI300X, Mac Studio 192GB
405B	~810 GB	~405 GB	~230 GB	8x H100, DGX H200, MI300X cluster

Add 10-25% on top of weights for KV-cache at your target context length. Sector88 memory tiering recovers most of this and lets you exceed the naive VRAM limit, especially for batched serving.

Tell us your hardware.

Send us your spec and the model you want to run. We will tell you honestly whether it fits, how fast it will run, and which tier applies to your configuration.

Talk to the team See benchmarks

Every GPU. Every CPU. Every edge box.