Fig. 5.0 · Performance

Measured throughput on real hardware.

Sector88 delivers production inference from datacentre GPUs down to 25W edge accelerators.

[ Independently verified ]

43 tok/s on a $3K card.

Deployed at AI Sweden (the Swedish national centre for applied AI). A 20B parameter model running production inference on a single NVIDIA L4 with no tensor parallelism.

Model

gpt-oss-20b

20 billion parameters. Served via OpenAI-compatible API. Runtime selects the optimal engine automatically.

Hardware

1x NVIDIA L4

24GB VRAM. Single card, no NVLink, no rack-scale infrastructure. The kind of hardware you actually have.

Throughput

43 tok/s

Single-stream generation. Interactive speed. Measured and verified independently by the customer.

“It brings real value, especially for data scientists. Decreased setup time, a unified API, automatic GPU tuning. The runtime did what it claimed on our hardware without us rewriting anything.”

Laurian Lamba

Systems Architect · AI Sweden

[ Datacentre ]

Throughput at scale.

On datacentre GPUs, Sector88 Runtime delivers full-speed inference with continuous batching, automatic precision selection, and memory paging. These numbers reflect what you get out of the box.

Runtime architecture diagram

vLLM engine

High-throughput continuous batching. Default for datacentre and modern NVIDIA GPUs.

Llama 3.1 8B

Hardware Workload Generation tok/s
H100 80GBChat, concurrent6,067
A100 80GBChat, concurrent2,622
A100 40GBChat, concurrent2,459
L4 24GB (FP8)Chat, concurrent1,800

Source: Microsoft Azure HPC benchmarks, Llama 3.1 8B (2025). Aggregate throughput across concurrent requests.

TensorRT-LLM engine

NVIDIA's optimized runtime. Selected automatically on Hopper and Blackwell when it outperforms vLLM for the workload.

Llama 3.3 70B FP8

Hardware Sequence length Generation tok/s
2x H200128 in, 2048 out7,467
2x H200128 in, 128 out6,328
2x H100128 in, 128 out6,092
2x H2002048 in, 2048 out3,776
2x H1002048 in, 2048 out2,786

Source: NVIDIA TensorRT-LLM performance overview (current release). FP8, tensor parallelism = 2.

[ Edge and constrained ]

Usable inference on 25 watts.

Most platforms require a datacentre. Sector88 delivers interactive-speed AI on Jetson modules, CPU-only servers, and hardware that was never designed for inference.

Memory orchestration diagram

llama.cpp engine

Quantized models on edge hardware and CPUs. Selected automatically when the box has no high-end GPU.

Hardware Model / Quantization Generation tok/s
Jetson Orin AGX 64GBLlama 3 8B, INT436
Jetson Orin NX 16GB (25W)Llama 3.2 1B, Q8_017-20
Jetson Orin Nano 8GBLlama 3.1 8B, Q414
Jetson Orin NX 16GB (25W)Llama 3.2 3B, Q4_K_M10-12
CPU server (x86, 32 cores)8B, Q4_041-122

Sources: NVIDIA Jetson AI Lab, Advantech edge benchmarks, llama.cpp community (PowerPC FP16 MMA path).

[ Memory orchestration ]

Serve models larger than your GPU.

Traditional KV-cache management wastes 60 to 80% of GPU memory through fragmentation. Sector88 pages that memory and tiers it across VRAM, RAM, and NVMe, so you can serve models that exceed your VRAM without crashing.

Memory tier diagram

Throughput vs naive serving

24x

Compared to HuggingFace Transformers. 2 to 3.5x improvement over HuggingFace TGI.

Memory recovered

60-80%

GPU memory previously lost to KV-cache fragmentation, recovered through paging and reused for serving.

Memory hierarchy

3 tiers

VRAM for hot pages, system RAM for warm, NVMe for cold. Graceful spill, no crashes, no manual configuration.

Based on PagedAttention (Kwon et al., SOSP 2023, arXiv:2309.06180). Sector88 implements and extends this across the full storage hierarchy.

[ Hardware ]

Where we deliver these numbers.

Runtime probes the hardware on first start, selects the optimal engine, and configures memory tiering automatically. You do not choose the engine. You get the fastest one for your box.

Deployment environments diagram
Tier Hardware Engine selected Model capacity
Datacentre H100, H200, A100 80GB, MI300X vLLM or TensorRT-LLM 70B+ full precision, 405B with tensor parallelism
Inference card L4, L40S, A10, RTX 4090 vLLM Up to 70B quantized with paging, 8-20B full precision
Workstation RTX 30/40 series, single-card vLLM or llama.cpp 8-13B comfortably, 70B quantized with tiering
Edge Jetson Orin AGX, NX, Nano llama.cpp or TRT-LLM 1B to 8B quantized, multimodal where supported
CPU-only x86 Xeon, ARM, PowerPC llama.cpp 3-8B quantized at interactive speed

[ What we add ]

What makes it production-ready.

Throughput is the floor. These are the things that keep inference stable under real-world load.

API compatibility diagram

Automatic engine selection

Probes hardware on boot. Picks the fastest engine for your card. No configuration required.

Memory tiering

Pages KV-cache across VRAM, RAM, and NVMe. Models that exceed your VRAM still serve without crashing.

OOM protection

Refuses configurations that would crash. Degrades gracefully under load instead of falling over.

OpenAI-compatible API

One endpoint regardless of engine or hardware. Your application code never changes.

[ Models ]

Supported models.

If it runs in vLLM, llama.cpp, or TensorRT-LLM, it runs in Sector88. Bring your own fine-tuned weights. We handle the rest.

Llama 3.1 / 3.2 / 3.3

1B to 405B. Most deployed model family across our customer base.

Mixtral 8x22B / 8x7B

Mixture-of-experts. High quality at efficient compute cost.

Qwen 2.5

Strong on coding and mathematics. Multiple size variants.

Phi-4

14B. Efficient for structured tasks on smaller hardware.

DeepSeek V3 / R1

Reasoning-focused. Permissive licence.

Gemma 2 / 3

Google. Efficient open-weights with broad support.

Custom fine-tunes

GGUF, SafeTensors, HuggingFace format. Bring your own weights.

Embeddings and classifiers

BGE, E5, Jina, MiniLM, or your private encoder. Same install path.

Something else?

If the engine supports it, we run it. Talk to us about your specific model.

Benchmark it on your hardware.

Tell us your hardware, your model, and your workload. We will run Sector88 in your environment and show you exactly what it delivers.