Fig. 5.0 · Performance

Measured throughput
on real hardware.

Sector88 delivers production inference from datacentre GPUs down to 25W edge accelerators.

Estimate for your hardware Benchmark on yours

[ Independently verified ]

43 tok/s on a $3K card.

Deployed at AI Sweden (the Swedish national centre for applied AI). A 20B parameter model running production inference on a single NVIDIA L4 with no tensor parallelism.

Model

gpt-oss-20b

20 billion parameters. Served via OpenAI-compatible API. Runtime selects the optimal engine automatically.

Hardware

1x NVIDIA L4

24GB VRAM. Single card, no NVLink, no rack-scale infrastructure. The kind of hardware you actually have.

Throughput

43 tok/s

Single-stream generation. Interactive speed. Measured and verified independently by the customer.

“It brings real value, especially for data scientists. Decreased setup time, a unified API, automatic GPU tuning. The runtime did what it claimed on our hardware without us rewriting anything.”

Laurian Lamba

Systems Architect · AI Sweden

[ Datacentre ]

Throughput at scale.

On datacentre GPUs, Sector88 Runtime delivers full-speed inference with continuous batching, automatic precision selection, and memory paging. These numbers reflect what you get out of the box.

vLLM engine

High-throughput continuous batching. Default for datacentre and modern NVIDIA GPUs.

Llama 3.1 8B

Hardware	Workload	Generation tok/s	Prefill tok/s
H100 80GB	Chat, concurrent	6,067	2,667
A100 80GB	Chat, concurrent	2,622	1,177
A100 40GB	Chat, concurrent	2,459	1,069
L4 24GB (FP8)	Chat, concurrent	1,800

Source: Microsoft Azure HPC benchmarks, Llama 3.1 8B (2025). Aggregate throughput across concurrent requests.

TensorRT-LLM engine

NVIDIA's optimized runtime. Selected automatically on Hopper and Blackwell when it outperforms vLLM for the workload.

Llama 3.3 70B FP8

Hardware	Sequence length	Generation tok/s
2x H200	128 in, 2048 out	7,467
2x H200	128 in, 128 out	6,328
2x H100	128 in, 128 out	6,092
2x H200	2048 in, 2048 out	3,776
2x H100	2048 in, 2048 out	2,786

Source: NVIDIA TensorRT-LLM performance overview (current release). FP8, tensor parallelism = 2.

[ Edge and constrained ]

Usable inference on 25 watts.

Most platforms require a datacentre. Sector88 delivers interactive-speed AI on Jetson modules, CPU-only servers, and hardware that was never designed for inference.

llama.cpp engine

Quantized models on edge hardware and CPUs. Selected automatically when the box has no high-end GPU.

Hardware	Model / Quantization	Generation tok/s
Jetson Orin AGX 64GB	Llama 3 8B, INT4	36
Jetson Orin NX 16GB (25W)	Llama 3.2 1B, Q8_0	17-20
Jetson Orin Nano 8GB	Llama 3.1 8B, Q4	14
Jetson Orin NX 16GB (25W)	Llama 3.2 3B, Q4_K_M	10-12
CPU server (x86, 32 cores)	8B, Q4_0	41-122

Sources: NVIDIA Jetson AI Lab, Advantech edge benchmarks, llama.cpp community (PowerPC FP16 MMA path).

[ Memory orchestration ]

Serve models larger than your GPU.

Traditional KV-cache management wastes 60 to 80% of GPU memory through fragmentation. Sector88 pages that memory and tiers it across VRAM, RAM, and NVMe, so you can serve models that exceed your VRAM without crashing.

Throughput vs naive serving

24x

Compared to HuggingFace Transformers. 2 to 3.5x improvement over HuggingFace TGI.

Memory recovered

60-80%

GPU memory previously lost to KV-cache fragmentation, recovered through paging and reused for serving.

Memory hierarchy

3 tiers

VRAM for hot pages, system RAM for warm, NVMe for cold. Graceful spill, no crashes, no manual configuration.

Based on PagedAttention (Kwon et al., SOSP 2023, arXiv:2309.06180). Sector88 implements and extends this across the full storage hierarchy.

[ Hardware ]

Where we deliver these numbers.

Runtime probes the hardware on first start, selects the optimal engine, and configures memory tiering automatically. You do not choose the engine. You get the fastest one for your box.

Tier	Hardware	Engine selected	Model capacity
Datacentre	H100, H200, A100 80GB, MI300X	vLLM or TensorRT-LLM	70B+ full precision, 405B with tensor parallelism
Inference card	L4, L40S, A10, RTX 4090	vLLM	Up to 70B quantized with paging, 8-20B full precision
Workstation	RTX 30/40 series, single-card	vLLM or llama.cpp	8-13B comfortably, 70B quantized with tiering
Edge	Jetson Orin AGX, NX, Nano	llama.cpp or TRT-LLM	1B to 8B quantized, multimodal where supported
CPU-only	x86 Xeon, ARM, PowerPC	llama.cpp	3-8B quantized at interactive speed

Full hardware compatibility matrix

[ What we add ]

What makes it production-ready.

Throughput is the floor. These are the things that keep inference stable under real-world load.

Automatic engine selection

Probes hardware on boot. Picks the fastest engine for your card. No configuration required.

Memory tiering

Pages KV-cache across VRAM, RAM, and NVMe. Models that exceed your VRAM still serve without crashing.

OOM protection

Refuses configurations that would crash. Degrades gracefully under load instead of falling over.

OpenAI-compatible API

One endpoint regardless of engine or hardware. Your application code never changes.

[ Models ]

Supported models.

If it runs in vLLM, llama.cpp, or TensorRT-LLM, it runs in Sector88. Bring your own fine-tuned weights. We handle the rest.

Llama 3.1 / 3.2 / 3.3

1B to 405B. Most deployed model family across our customer base.

Mixtral 8x22B / 8x7B

Mixture-of-experts. High quality at efficient compute cost.

Qwen 2.5

Strong on coding and mathematics. Multiple size variants.

Phi-4

14B. Efficient for structured tasks on smaller hardware.

DeepSeek V3 / R1

Reasoning-focused. Permissive licence.

Gemma 2 / 3

Google. Efficient open-weights with broad support.

Custom fine-tunes

GGUF, SafeTensors, HuggingFace format. Bring your own weights.

Embeddings and classifiers

BGE, E5, Jina, MiniLM, or your private encoder. Same install path.

Something else?

If the engine supports it, we run it. Talk to us about your specific model.

Benchmark it on your hardware.

Tell us your hardware, your model, and your workload. We will run Sector88 in your environment and show you exactly what it delivers.

Talk to the team Try the calculator

Measured throughput on real hardware.