Every founder has a moment where they realize the thing they need doesn’t exist. For me, it happened at 3 AM, debugging GPU memory crashes in production ML systems. (The glamorous founder life you read about in TechCrunch.)
I wasn’t building Sector88 yet. I was just trying to keep inference running.
The Problem Kept Repeating
Before Sector88, I spent over a decade building and scaling technical infrastructure across biotech, spacetech, medtech, and deep tech startups. I’ve designed data systems that process petabytes, built ML pipelines that run 24/7, and scaled infrastructure from prototype to global deployment.
But every time I deployed LLMs in production, the same problem emerged: GPU memory management breaks in ways traditional infrastructure never does.
The specific issue that crystallized everything was watching Llama 2 crash repeatedly after running fine for hours. We’d tune n_gpu_layers conservatively. Monitor VRAM usage. Set allocation parameters carefully. Still crashed.
The root cause: GPU memory isn’t just model weights. It’s KV cache that grows with context. Attention buffers. Activation tensors. Memory fragmentation from repeated allocations. The static n_gpu_layers parameter assumes memory usage is predictable. It’s not.
This wasn’t a configuration problem. This was a fundamental infrastructure problem.
Why This Matters
The organizations that hit this problem aren’t edge cases. They’re massive sectors where cloud APIs aren’t viable. (Yes, sectors. The name writes itself.)
Defense and government facilities where data physically cannot leave the perimeter. Air-gapped environments where internet connectivity is prohibited by design. These aren’t theoretical constraints, they’re operational requirements.
Healthcare and life sciences where regulatory compliance means patient data sovereignty is non-negotiable. HIPAA, GDPR, data residency laws. Cloud APIs introduce legal risk many organizations won’t accept.
Industrial operations and energy at remote sites with no reliable connectivity. Offshore platforms, mining operations, research stations. Places where cloud infrastructure is physically impossible.
High-security research where pre-publication data, proprietary work, or competitive intelligence requires complete isolation. Air-gapped operation ensures zero leakage.
I kept seeing the same pattern: organizations with serious AI applications, constrained hardware, compliance requirements, and infrastructure that kept breaking. They needed LLM inference to work reliably, but the existing tools assumed unlimited resources or cloud deployment.
The Infrastructure Gap
From years building large-scale data systems and ML infrastructure, I knew this was a solvable problem. Operating systems do dynamic memory management. Databases do buffer pool optimization. Kubernetes does resource orchestration.
Why were we still manually calculating GPU layer allocations like it’s 2018?
The answer: nobody was building for real constraints. The infrastructure assumed elastic capacity, unlimited resources, or “just add more GPUs” solutions. (NVIDIA’s sales team loves this answer, but it doesn’t help when your procurement cycle is six months and your budget was approved last year.)
Nobody was solving the problem we actually had: making fixed hardware work reliably with unpredictable memory patterns.
I spent weeks looking for existing solutions. There weren’t any that worked for production deployments with hard constraints.
Building the Solution
The first prototype was hacky but revealing. Binary search on GPU layer configurations, real-time memory monitoring, dynamic adjustment. It was 200 lines of Python that shouldn’t have worked as well as it did.
But it proved the concept: treat GPU memory as a resource management problem, not a configuration problem. Let the system probe available memory, test configurations, monitor usage, and adapt dynamically.
We built a small team of exceptional engineers. People who’ve worked deep in GPU optimization, inference engines (llama.cpp, vLLM), and systems reliability. Everyone writes code. Everyone owns systems. Small teams that ship fast.
We deployed early versions in production. Real workloads. Hard constraints. Air-gapped facilities. The kind of deployments where “just restart it” isn’t an option because the system needs to run for months without intervention.
What Production Taught Us
Running S88 in constrained environments revealed things you only learn from sustained production deployments:
Memory patterns are wildly unpredictable. The same model with identical context size uses different VRAM depending on content. Repeated tokens behave differently than diverse sequences. Static allocation fundamentally cannot work.
Edge cases reveal architecture issues. One deployment crashed every few days at seemingly random intervals. Memory fragmentation from specific model switch sequences. You only find these with real workloads running continuously.
Reliability matters more than peak performance. A system that’s 10% slower but never crashes is infinitely more valuable than one that’s fast but OOMs randomly. From scaling infrastructure globally, I learned that boring reliability wins.
Air-gapped deployment is its own discipline. Getting model weights into facilities without internet requires operational infrastructure. Physical transfer processes. Checksum validation. Version control for 100GB+ files. These details matter as much as inference performance.
The Validation
The moment we knew this was bigger than our internal problem: a researcher at a European institution reached out. They had GPUs. They had models. They had use cases. Nothing stayed running.
“Memory leaks, OOM crashes, restarts every few hours. We’ve spent weeks tuning parameters manually.”
We gave them an early build of S88. A week later: “We haven’t had a single crash. This is the first time we’ve run inference for more than 48 hours straight.”
That’s when I realized we weren’t just solving our problem. We were solving a fundamental infrastructure gap that affects every organization deploying LLMs with real constraints.
What Sector88 Is
S88 isn’t competing on raw speed. vLLM is faster when you have unlimited resources. TGI has better batching for cloud.
Sector88 solves a different problem: make LLM deployment work in constrained environments where nothing else does.
That means:
Auto-offload by default. The system figures out optimal GPU/CPU splits automatically. You shouldn’t need deep expertise in GPU architecture to deploy a model.
Air-gapped operation. No internet required. No telemetry. No phone-home licensing. Everything works completely offline because that’s what production deployments in secure environments require.
Enterprise hardening. License validation, audit trails, security controls. The operational features that matter when deploying in regulated industries.
Predictable stability. No memory leaks. No surprise crashes. No “worked yesterday, broken today.” Infrastructure should be boring enough that you forget it exists.
We’re building for deployments that need to run for months on a single GPU at a remote facility with no internet. That’s not a benchmark case. That’s production reality.
The Philosophy
There’s a principle I’ve learned from building infrastructure that scales: good infrastructure makes hard problems invisible.
You don’t think about your database’s page cache management. You don’t manually optimize your OS’s memory allocation. You don’t hand-tune Kubernetes resource limits for every pod.
Those systems handle complexity automatically so you can build on top of them. That’s what infrastructure should do.
That’s the goal for S88: inference infrastructure you don’t have to think about. Point at a model. It runs. It stays running. You build your application on top. The memory management, the optimization, the stability fade into the background.
What’s Next
We’re running production deployments with organizations across defense, healthcare, energy, research, and other regulated industries. Every deployment teaches us something new about constraints we didn’t anticipate.
The roadmap is driven by what real production needs:
Multi-GPU sharding for models that exceed single-GPU capacity. Llama-405B and other frontier models require this.
Intelligent quantization selection that balances memory constraints against quality requirements automatically.
Better tooling for air-gapped distribution. Secure model packaging, integrity validation, operational infrastructure for isolated facilities.
Deeper observability into memory patterns, inference behavior, and system health. The kind of monitoring that lets you sleep at night when critical systems are running.
Why We’re Building This
I’ve spent over 15 years building infrastructure that scales. I’ve worked across biotech, spacetech, medtech, and deep tech. I’ve designed systems that handle massive data, built ML pipelines that run continuously, and solved infrastructure problems from prototype to global deployment.
The organizations we’re building for aren’t niche. They span defense, healthcare, financial services, energy, research, and other regulated industries where data sovereignty, compliance, and reliability aren’t optional.
If you’re deploying LLMs in constrained environments, if you’re hitting memory crashes in production, if you’re tired of infrastructure that assumes unlimited resources, get in touch. Or connect with me on LinkedIn to discuss infrastructure challenges.
We’re building this for the deployments that matter: where data can’t leave, where hardware is fixed, where reliability is essential, and where “just use the cloud” isn’t an answer.
That’s why we built Sector88.