[ Runtime ]

Runtime manages the model on your hardware.

Probes the box, picks the engine, tiers memory across what you have, and serves an OpenAI-compatible API. One install. Any hardware.

Talk to the team How it works

s88 serve --model Llama-3-70B-Q4_K_M

INITIALIZING

Llama-3-70B-Q4_K_M

GGUF Q4_K_M 70B params

Detected

Backend Selection

Auto

llama.cpp vLLM TensorRT-LLM Triton

Memory Hierarchy

PASS

VRAM (Tier 1)

16.8 / 24 GB

RAM (Tier 2)

42.3 / 64 GB

SSD Cache (Tier 3)

128 / 512 GB

Serving

localhost:8088/v1/chat/completions

Throughput

7.8 tok/s

Latency

118 ms

OOM Events

Uptime

[ Capabilities ]

What Runtime does.

Fig 2.1

Tiered memory orchestration

Model weights and cache move across GPU, CPU, and disk. Large models run on hardware that would not normally hold them.

Fig 2.2

Engine auto-selection

Runtime wraps llama.cpp, vLLM, TensorRT-LLM. You pick the model. Runtime picks the engine. When a faster one arrives, you inherit it.

Fig 2.3

OpenAI-compatible API

Drop-in for the OpenAI endpoint. Point existing software at a local URL. Embeddings, classification, tool-calls, and chat.

Fig 2.4

Offline by default

Zero outbound calls. No license pings. No phone-home. Install over any medium, run on an empty network.

Fig 2.5

Hardware probing

Runtime inspects the box on first start and configures itself. A CPU-only field server, a single Jetson, or a rack of H100s. Same install.

Fig 2.6

Preflight validation

Runtime validates the model fits before loading. No OOM crashes. No manual calculation. It checks your hardware and tells you what works.

[ Comparison ]

Runtime vs open-source tools.

The open-source ecosystem ships great inference engines, not platforms. Runtime is the orchestration layer around them: probing, picking, tiering memory, and serving an OpenAI-compatible API.

Feature

Sector88

Typical OSS

Memory orchestration

GPU → CPU → disk, automatic

Single tier only

Engine support

llama.cpp, vLLM, TensorRT-LLM

One engine per tool

Hardware config

Auto-probes and adapts

Manual tuning required

API compatibility

OpenAI-compatible

Varies by engine

Offline operation

Zero outbound calls

Often requires cloud

Hardware Agnostic

Any GPU, any backend, any model, anywhere.

Hardware Platforms

NVIDIA CUDA

Popular

AMD ROCm

Intel Gaudi / Xeon

Google TPU

Qualcomm AI

Apple Silicon

CPU Servers

View all supported hardware →

Inference Backends

PyTorch Supported

Native inference

vLLM Supported

PagedAttention optimization

llama.cpp Supported

GGUF models, CPU/GPU

TensorRT-LLM Roadmap

NVIDIA optimization

Triton Roadmap

NVIDIA inference server

Ollama Roadmap

Developer tooling

View all supported backends →

Run models on hardware you already have.

One install. GPU, CPU, TPU, or mixed. We meet the box where it is.

Talk to the team