Fig. 3.0 · Technology
Memory orchestration
across every tier.
The orchestration layer the open-source ecosystem doesn't ship. We build the API, runtime, and memory tier. We run any inference engine underneath.
[ Memory ]
Tiered memory orchestration.
The problem is not that models are too big. The problem is that inference tools assume everything fits in one place.
We track every tensor's access pattern. Hot weights stay in GPU VRAM. Warm weights move to system RAM. Cold weights sit on NVMe until called. The cache is tiered the same way.
This is not quantization. Quantization reduces precision. We reduce residency. A 70B parameter model at FP16 needs 140GB of VRAM. With tiered orchestration, it runs on 24GB.
The trade-off is latency, not accuracy. First-token latency increases when a layer needs to be pulled from disk. But once warm, throughput is governed by the hardware you have, not the hardware you wish you had.
[ Stack ]
The whole stack. One vendor.
Everything from the API your application calls to how weights move through memory. The only thing you bring is the metal.
Fig 3.2 · Stack
API
openai-compatibleDrop-in for the OpenAI endpoint.
Runtime
orchestrationProbes the box, picks the engine, serves traffic.
Memory orchestration
Weights tiered across VRAM, RAM, and disk.
Engine
swappableOpen-source engines, picked per workload.
Compute
your metalGPU, CPU, TPU, or mixed.
One vendor
API, runtime, memory tier, engine integration. All ours.
Open underneath
We integrate the best engines. You inherit upgrades.
Bring your metal
The only piece we do not own is the hardware.
[ Comparison ]
Why not just use open-source tools?
You can. But each solves one problem well and leaves the rest to you.
llama.cpp alone
Excellent for single-node CPU inference. No fleet management, no memory tiering beyond mmap, no hot-swap. You build the orchestration layer yourself.
vLLM alone
Best throughput on NVIDIA GPUs with PagedAttention. No CPU fallback, no disk tiering, no offline operation, no fleet control plane. You run one model per server.
TensorRT-LLM alone
Fastest on NVIDIA when you have time to compile. No dynamic engine switching, no mixed hardware, no edge deployment. You optimize once per GPU type.
See it run on your hardware.
Bring your stack. We will benchmark Sector88 against it on the boxes you already own.