Memory orchestration.
How Sector88 Runtime runs 70B parameter models on 24GB VRAM without crashing — and why quantization alone is not enough.
[ Overview ]
A 70B parameter model at FP16 requires ~140GB of memory. Most edge GPUs have 24GB or less. The standard industry response is quantization: compress the model to INT8 or INT4 and hope the quality loss is acceptable. We take a different view.
Sector88 Runtime treats memory as a hierarchy — VRAM, system RAM, and NVMe storage — and orchestrates model layers across all three tiers dynamically. The result is that a 70B model can serve inference on a 24GB GPU with latency that is usable for real workloads, not just demos.
[ Key Numbers ]
- 70B model on 24GB VRAM
- <150ms per token (typical)
- 3-tier memory orchestration
- Zero quantization required
[ Architecture ]
Layer-aware scheduling
Most inference engines load the entire model weights into GPU memory and fail if they do not fit. Runtime splits the model into its transformer layers and loads only the active layer into VRAM. The next layer is prefetched from RAM or NVMe while the current layer is computing. This is not paging — it is predictable, deterministic scheduling based on the model's known forward-pass graph.
Tiered memory pool
VRAM holds the current layer and the attention cache. System RAM holds the next N layers and the KV cache for recent tokens. NVMe holds the full model weights and cold-start state. Runtime monitors transfer bandwidth between tiers and adjusts prefetch depth based on observed throughput. On a PCIe 4.0 NVMe, layer transfer from disk to VRAM takes ~80ms — well within the inter-token window for many workloads.
KV cache offloading
For long-context inference, the KV cache grows linearly with sequence length and can exceed VRAM capacity before model weights do. Runtime offloads the oldest KV cache entries to system RAM using a ring buffer. Cache eviction is based on attention entropy — low-entropy heads (those that have stabilized) are evicted first, preserving the most information-dense state in fast memory.
Why not just quantize?
Quantization is valid and we support it. But it is a quality trade-off, not an architecture solution. At INT4, a 70B model can fit in 24GB — but reasoning quality degrades measurably on coding and mathematical tasks. Our approach preserves full FP16 precision for the active layer while using quantization only for cold tiers if the operator chooses. The default is unquantized, tiered orchestration.
See it on your hardware.
We will benchmark your target model on your actual GPU and show you the latency numbers before you commit.