Fig. 5.0 · Performance
Measured throughput
on real hardware.
Sector88 delivers production inference from datacentre GPUs down to 25W edge accelerators.
[ Independently verified ]
43 tok/s on a $3K card.
Deployed at AI Sweden (the Swedish national centre for applied AI). A 20B parameter model running production inference on a single NVIDIA L4 with no tensor parallelism.
Model
gpt-oss-20b
20 billion parameters. Served via OpenAI-compatible API. Runtime selects the optimal engine automatically.
Hardware
1x NVIDIA L4
24GB VRAM. Single card, no NVLink, no rack-scale infrastructure. The kind of hardware you actually have.
Throughput
43 tok/s
Single-stream generation. Interactive speed. Measured and verified independently by the customer.
“It brings real value, especially for data scientists. Decreased setup time, a unified API, automatic GPU tuning. The runtime did what it claimed on our hardware without us rewriting anything.”
Laurian Lamba
Systems Architect · AI Sweden
[ Datacentre ]
Throughput at scale.
On datacentre GPUs, Sector88 Runtime delivers full-speed inference with continuous batching, automatic precision selection, and memory paging. These numbers reflect what you get out of the box.
vLLM engine
High-throughput continuous batching. Default for datacentre and modern NVIDIA GPUs.
Llama 3.1 8B
| Hardware | Workload | Generation tok/s |
|---|---|---|
| H100 80GB | Chat, concurrent | 6,067 |
| A100 80GB | Chat, concurrent | 2,622 |
| A100 40GB | Chat, concurrent | 2,459 |
| L4 24GB (FP8) | Chat, concurrent | 1,800 |
Source: Microsoft Azure HPC benchmarks, Llama 3.1 8B (2025). Aggregate throughput across concurrent requests.
TensorRT-LLM engine
NVIDIA's optimized runtime. Selected automatically on Hopper and Blackwell when it outperforms vLLM for the workload.
Llama 3.3 70B FP8
| Hardware | Sequence length | Generation tok/s |
|---|---|---|
| 2x H200 | 128 in, 2048 out | 7,467 |
| 2x H200 | 128 in, 128 out | 6,328 |
| 2x H100 | 128 in, 128 out | 6,092 |
| 2x H200 | 2048 in, 2048 out | 3,776 |
| 2x H100 | 2048 in, 2048 out | 2,786 |
Source: NVIDIA TensorRT-LLM performance overview (current release). FP8, tensor parallelism = 2.
[ Edge and constrained ]
Usable inference on 25 watts.
Most platforms require a datacentre. Sector88 delivers interactive-speed AI on Jetson modules, CPU-only servers, and hardware that was never designed for inference.
llama.cpp engine
Quantized models on edge hardware and CPUs. Selected automatically when the box has no high-end GPU.
| Hardware | Model / Quantization | Generation tok/s |
|---|---|---|
| Jetson Orin AGX 64GB | Llama 3 8B, INT4 | 36 |
| Jetson Orin NX 16GB (25W) | Llama 3.2 1B, Q8_0 | 17-20 |
| Jetson Orin Nano 8GB | Llama 3.1 8B, Q4 | 14 |
| Jetson Orin NX 16GB (25W) | Llama 3.2 3B, Q4_K_M | 10-12 |
| CPU server (x86, 32 cores) | 8B, Q4_0 | 41-122 |
Sources: NVIDIA Jetson AI Lab, Advantech edge benchmarks, llama.cpp community (PowerPC FP16 MMA path).
[ Memory orchestration ]
Serve models larger than your GPU.
Traditional KV-cache management wastes 60 to 80% of GPU memory through fragmentation. Sector88 pages that memory and tiers it across VRAM, RAM, and NVMe, so you can serve models that exceed your VRAM without crashing.
Throughput vs naive serving
24x
Compared to HuggingFace Transformers. 2 to 3.5x improvement over HuggingFace TGI.
Memory recovered
60-80%
GPU memory previously lost to KV-cache fragmentation, recovered through paging and reused for serving.
Memory hierarchy
3 tiers
VRAM for hot pages, system RAM for warm, NVMe for cold. Graceful spill, no crashes, no manual configuration.
Based on PagedAttention (Kwon et al., SOSP 2023, arXiv:2309.06180). Sector88 implements and extends this across the full storage hierarchy.
[ Hardware ]
Where we deliver these numbers.
Runtime probes the hardware on first start, selects the optimal engine, and configures memory tiering automatically. You do not choose the engine. You get the fastest one for your box.
| Tier | Hardware | Engine selected | Model capacity |
|---|---|---|---|
| Datacentre | H100, H200, A100 80GB, MI300X | vLLM or TensorRT-LLM | 70B+ full precision, 405B with tensor parallelism |
| Inference card | L4, L40S, A10, RTX 4090 | vLLM | Up to 70B quantized with paging, 8-20B full precision |
| Workstation | RTX 30/40 series, single-card | vLLM or llama.cpp | 8-13B comfortably, 70B quantized with tiering |
| Edge | Jetson Orin AGX, NX, Nano | llama.cpp or TRT-LLM | 1B to 8B quantized, multimodal where supported |
| CPU-only | x86 Xeon, ARM, PowerPC | llama.cpp | 3-8B quantized at interactive speed |
[ What we add ]
What makes it production-ready.
Throughput is the floor. These are the things that keep inference stable under real-world load.
Automatic engine selection
Probes hardware on boot. Picks the fastest engine for your card. No configuration required.
Memory tiering
Pages KV-cache across VRAM, RAM, and NVMe. Models that exceed your VRAM still serve without crashing.
OOM protection
Refuses configurations that would crash. Degrades gracefully under load instead of falling over.
OpenAI-compatible API
One endpoint regardless of engine or hardware. Your application code never changes.
[ Models ]
Supported models.
If it runs in vLLM, llama.cpp, or TensorRT-LLM, it runs in Sector88. Bring your own fine-tuned weights. We handle the rest.
Llama 3.1 / 3.2 / 3.3
1B to 405B. Most deployed model family across our customer base.
Mixtral 8x22B / 8x7B
Mixture-of-experts. High quality at efficient compute cost.
Qwen 2.5
Strong on coding and mathematics. Multiple size variants.
Phi-4
14B. Efficient for structured tasks on smaller hardware.
DeepSeek V3 / R1
Reasoning-focused. Permissive licence.
Gemma 2 / 3
Google. Efficient open-weights with broad support.
Custom fine-tunes
GGUF, SafeTensors, HuggingFace format. Bring your own weights.
Embeddings and classifiers
BGE, E5, Jina, MiniLM, or your private encoder. Same install path.
Something else?
If the engine supports it, we run it. Talk to us about your specific model.
Benchmark it on your hardware.
Tell us your hardware, your model, and your workload. We will run Sector88 in your environment and show you exactly what it delivers.