Home About Projects Blog Subscribe Login

The Cost of Being "Always On"

High availability is a technical requirement, but it shouldn't be a lifestyle. How I built Link11 to protect the internet without burning out the people who run it. The philosophy of sustainable scale.

The Uptime Paradox

In cybersecurity, "always on" isn't aspirational—it's contractual. Our customers depend on us to keep their services online 24/7/365. A DDoS attack doesn't wait for business hours. When Link11's mitigation infrastructure goes down, millions of users feel it.

So we chase nines. 99.9% uptime. 99.99%. The holy grail of 99.999%—five nines, translating to just 5.26 minutes of downtime per year.

But here's what nobody tells you: every additional nine costs exponentially more—not just in infrastructure, but in human resilience.

When High Availability Becomes Low Sustainability

In my first decade running Link11, I wore the pager like a badge of honor. Middle-of-the-night alerts? Part of the job. Weekend escalations? That's what founders do. I thought I was modeling dedication.

What I was actually modeling was burnout as a feature, not a bug.

The warning signs were subtle at first:

The cruel irony? Exhausted engineers make more mistakes. The very culture designed to maximize uptime was introducing new failure modes through human error.

The Realization: Systems Can Be Resilient Without People Being Fragile

The turning point came during a particularly brutal incident in 2018. We had a multi-vector DDoS attack that lasted 72 hours. Our scrubbing nodes held. Our architecture was brilliant. But our people were collapsing.

I watched a senior engineer—one of our best—break down in tears after a week of 3am escalations. That's when I realized: we had optimized the wrong variable.

High availability isn't sustainable if the team maintaining it isn't.

Redesigning for Human-Scale Operations

Here's what changed:

1. We Automated the Hell Out of Everything

Not just deployment pipelines—decision-making. If a known attack pattern appeared, the system didn't page a human. It executed the playbook autonomously and merely notified the team.

Humans became supervisors, not executors. We went from 40 pages/week to 4.

2. We Built "Graceful Degradation" Into Every Service

Instead of aiming for "never goes down," we designed for "fails elegantly." When a component hit capacity, it didn't crash—it shed non-critical load and continued serving core functions.

We accepted that 100% uptime was impossible. What was possible: ensuring that when things broke, they broke in ways that didn't require emergency human intervention at 3am.

3. We Enforced True "Off" Time

We implemented mandatory rotation policies:

At first, engineers resisted. "What if something breaks and the person on-call doesn't know the system?" Fair concern. The answer: better documentation and better systems.

If only one person can fix it, you don't have an operations problem—you have a knowledge-transfer problem.

4. We Measured "Stress Debt" Like Technical Debt

Every quarter, we tracked:

If stress debt was accumulating, we treated it like a production outage. It got executive attention and a remediation plan.

The Results: Better Uptime, Healthier Team

The outcome surprised even me:

The Philosophy: Sustainable Scale

Here's my thesis: true resilience is systemic, not heroic.

If your platform's reliability depends on someone's willingness to sacrifice sleep, relationships, and mental health—you don't have a resilient system. You have a house of cards held together by human suffering.

The best infrastructure I've ever built isn't the one that never goes down. It's the one that handles failure gracefully and protects the people maintaining it as carefully as it protects the customers using it.

What This Means for You

If you're leading an engineering org, ask yourself:

If the answers make you uncomfortable, you're not alone. Most of us built our systems in an era when "always on" meant "someone is always on-call."

But we're in a new era now. AI agents, autonomous remediation, intelligent observability—the tools exist to build systems that don't require human sacrifice.

High availability should be an architectural property, not a lifestyle demand.

The Mandate: Build Systems That Let People Sleep

At Link11, we protect the internet. But we also protect the people who protect the internet.

That's not altruism—it's pragmatism. Sustainable systems require sustainable teams.

If your uptime depends on burning people out, you're one resignation away from an outage.

Build systems that let your people sleep. Your uptime—and your team—will thank you.


Follow the journey

Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.

Subscribe →