[ Runtime ]

Runtime manages the model on your hardware.

Probes the box, picks the engine, tiers memory across what you have, and serves an OpenAI-compatible API. One install. Any hardware.

s88 serve --model Llama-3-70B-Q4_K_M
INITIALIZING

Llama-3-70B-Q4_K_M

GGUF Q4_K_M 70B params
Detected

Backend Selection

Auto
llama.cpp vLLM TensorRT-LLM Triton

Memory Hierarchy

PASS

VRAM (Tier 1)

16.8 / 24 GB

RAM (Tier 2)

42.3 / 64 GB

SSD Cache (Tier 3)

128 / 512 GB

Serving

localhost:8088/v1/chat/completions

Throughput

7.8 tok/s

Latency

118 ms

OOM Events

0

Uptime

0s

[ Capabilities ]

What Runtime does.

Fig 2.1

Tiered memory orchestration

Tiered memory orchestration

Model weights and cache move across GPU, CPU, and disk. Large models run on hardware that would not normally hold them.

Fig 2.2

Engine auto-selection

Engine auto-selection

Runtime wraps llama.cpp, vLLM, TensorRT-LLM. You pick the model. Runtime picks the engine. When a faster one arrives, you inherit it.

Fig 2.3

OpenAI-compatible API

OpenAI-compatible API

Drop-in for the OpenAI endpoint. Point existing software at a local URL. Embeddings, classification, tool-calls, and chat.

Fig 2.4

Offline by default

Offline by default

Zero outbound calls. No license pings. No phone-home. Install over any medium, run on an empty network.

Fig 2.5

Hardware probing

Hardware probing

Runtime inspects the box on first start and configures itself. A CPU-only field server, a single Jetson, or a rack of H100s. Same install.

Fig 2.6

Preflight validation

Preflight validation

Runtime validates the model fits before loading. No OOM crashes. No manual calculation. It checks your hardware and tells you what works.

[ Comparison ]

Runtime vs open-source tools.

The open-source ecosystem ships great inference engines, not platforms. Runtime is the orchestration layer around them: probing, picking, tiering memory, and serving an OpenAI-compatible API.

Feature

Sector88

Typical OSS

Memory orchestration

GPU → CPU → disk, automatic

Single tier only

Engine support

llama.cpp, vLLM, TensorRT-LLM

One engine per tool

Hardware config

Auto-probes and adapts

Manual tuning required

API compatibility

OpenAI-compatible

Varies by engine

Offline operation

Zero outbound calls

Often requires cloud

Hardware Agnostic

Any GPU, any backend, any model, anywhere.

Hardware Platforms

NVIDIA CUDA
Popular
AMD ROCm
Intel Gaudi / Xeon
Google TPU
Qualcomm AI
Apple Silicon
CPU Servers

Inference Backends

PyTorch Supported
Native inference
vLLM Supported
PagedAttention optimization
llama.cpp Supported
GGUF models, CPU/GPU
TensorRT-LLM Roadmap
NVIDIA optimization
Triton Roadmap
NVIDIA inference server
Ollama Roadmap
Developer tooling

Run models on hardware you already have.

One install. GPU, CPU, TPU, or mixed. We meet the box where it is.