The Uptime Paradox
In cybersecurity, "always on" isn't aspirational—it's contractual. Our customers depend on us to keep their services online 24/7/365. A DDoS attack doesn't wait for business hours. When Link11's mitigation infrastructure goes down, millions of users feel it.
So we chase nines. 99.9% uptime. 99.99%. The holy grail of 99.999%—five nines, translating to just 5.26 minutes of downtime per year.
But here's what nobody tells you: every additional nine costs exponentially more—not just in infrastructure, but in human resilience.
When High Availability Becomes Low Sustainability
In my first decade running Link11, I wore the pager like a badge of honor. Middle-of-the-night alerts? Part of the job. Weekend escalations? That's what founders do. I thought I was modeling dedication.
What I was actually modeling was burnout as a feature, not a bug.
The warning signs were subtle at first:
- Engineers sleeping with laptops next to the bed, checking dashboards before brushing their teeth
- Vacations interrupted by production incidents, leading people to just… stop taking vacations
- Relationships strained by the constant low-grade anxiety of being on-call
- Talent attrition disguised as "seeking new challenges"—but really just escaping the grind
The cruel irony? Exhausted engineers make more mistakes. The very culture designed to maximize uptime was introducing new failure modes through human error.
The Realization: Systems Can Be Resilient Without People Being Fragile
The turning point came during a particularly brutal incident in 2018. We had a multi-vector DDoS attack that lasted 72 hours. Our scrubbing nodes held. Our architecture was brilliant. But our people were collapsing.
I watched a senior engineer—one of our best—break down in tears after a week of 3am escalations. That's when I realized: we had optimized the wrong variable.
High availability isn't sustainable if the team maintaining it isn't.
Redesigning for Human-Scale Operations
Here's what changed:
1. We Automated the Hell Out of Everything
Not just deployment pipelines—decision-making. If a known attack pattern appeared, the system didn't page a human. It executed the playbook autonomously and merely notified the team.
Humans became supervisors, not executors. We went from 40 pages/week to 4.
2. We Built "Graceful Degradation" Into Every Service
Instead of aiming for "never goes down," we designed for "fails elegantly." When a component hit capacity, it didn't crash—it shed non-critical load and continued serving core functions.
We accepted that 100% uptime was impossible. What was possible: ensuring that when things broke, they broke in ways that didn't require emergency human intervention at 3am.
3. We Enforced True "Off" Time
We implemented mandatory rotation policies:
- No one on-call for more than one week per month
- After a major incident, automatic 48-hour "incident leave"—no guilt, no exceptions
- Geographic distribution of on-call responsibility, so one person's sleep was never repeatedly sacrificed
At first, engineers resisted. "What if something breaks and the person on-call doesn't know the system?" Fair concern. The answer: better documentation and better systems.
If only one person can fix it, you don't have an operations problem—you have a knowledge-transfer problem.
4. We Measured "Stress Debt" Like Technical Debt
Every quarter, we tracked:
- Number of after-hours pages
- Average incident response time
- Employee self-reported stress levels (anonymized)
- Vacation days actually taken vs. accrued
If stress debt was accumulating, we treated it like a production outage. It got executive attention and a remediation plan.
The Results: Better Uptime, Healthier Team
The outcome surprised even me:
- Our actual uptime improved. Fewer human errors, better architecture, smarter automation.
- Incident recovery time dropped because well-rested engineers think more clearly under pressure.
- Retention skyrocketed. People stopped leaving for "work-life balance"—because they already had it.
- Innovation increased. When you're not perpetually firefighting, you have bandwidth for strategic improvements.
The Philosophy: Sustainable Scale
Here's my thesis: true resilience is systemic, not heroic.
If your platform's reliability depends on someone's willingness to sacrifice sleep, relationships, and mental health—you don't have a resilient system. You have a house of cards held together by human suffering.
The best infrastructure I've ever built isn't the one that never goes down. It's the one that handles failure gracefully and protects the people maintaining it as carefully as it protects the customers using it.
What This Means for You
If you're leading an engineering org, ask yourself:
- Can your system survive a key engineer being unreachable for 24 hours?
- Do your people feel guilty taking vacation?
- Is "hero culture" celebrated or quietly expected?
- When was the last incident where automation saved the day instead of a human?
If the answers make you uncomfortable, you're not alone. Most of us built our systems in an era when "always on" meant "someone is always on-call."
But we're in a new era now. AI agents, autonomous remediation, intelligent observability—the tools exist to build systems that don't require human sacrifice.
High availability should be an architectural property, not a lifestyle demand.
The Mandate: Build Systems That Let People Sleep
At Link11, we protect the internet. But we also protect the people who protect the internet.
That's not altruism—it's pragmatism. Sustainable systems require sustainable teams.
If your uptime depends on burning people out, you're one resignation away from an outage.
Build systems that let your people sleep. Your uptime—and your team—will thank you.
Follow the journey
Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.
Subscribe →