[ Runtime ]
Runtime manages the model on your hardware.
Probes the box, picks the engine, tiers memory across what you have, and serves an OpenAI-compatible API. One install. Any hardware.
Llama-3-70B-Q4_K_M
Backend Selection
AutoMemory Hierarchy
PASSVRAM (Tier 1)
16.8 / 24 GB
RAM (Tier 2)
42.3 / 64 GB
SSD Cache (Tier 3)
128 / 512 GB
Serving
localhost:8088/v1/chat/completions Throughput
7.8 tok/s
Latency
118 ms
OOM Events
0
Uptime
0s
[ Capabilities ]
What Runtime does.
Fig 2.1
Tiered memory orchestration
Model weights and cache move across GPU, CPU, and disk. Large models run on hardware that would not normally hold them.
Fig 2.2
Engine auto-selection
Runtime wraps llama.cpp, vLLM, TensorRT-LLM. You pick the model. Runtime picks the engine. When a faster one arrives, you inherit it.
Fig 2.3
OpenAI-compatible API
Drop-in for the OpenAI endpoint. Point existing software at a local URL. Embeddings, classification, tool-calls, and chat.
Fig 2.4
Offline by default
Zero outbound calls. No license pings. No phone-home. Install over any medium, run on an empty network.
Fig 2.5
Hardware probing
Runtime inspects the box on first start and configures itself. A CPU-only field server, a single Jetson, or a rack of H100s. Same install.
Fig 2.6
Preflight validation
Runtime validates the model fits before loading. No OOM crashes. No manual calculation. It checks your hardware and tells you what works.
[ Comparison ]
Runtime vs open-source tools.
The open-source ecosystem ships great inference engines, not platforms. Runtime is the orchestration layer around them: probing, picking, tiering memory, and serving an OpenAI-compatible API.
Feature
Sector88
Typical OSS
Memory orchestration
GPU → CPU → disk, automatic
Single tier only
Engine support
llama.cpp, vLLM, TensorRT-LLM
One engine per tool
Hardware config
Auto-probes and adapts
Manual tuning required
API compatibility
OpenAI-compatible
Varies by engine
Offline operation
Zero outbound calls
Often requires cloud
Hardware Agnostic
Any GPU, any backend, any model, anywhere.
Hardware Platforms
Inference Backends
Run models on hardware you already have.
One install. GPU, CPU, TPU, or mixed. We meet the box where it is.