The Fallacy of "Good Enough" in Infrastructure
Most companies build for the 99th percentile. They stress-test for peak holiday traffic. They rehearse the quarterly outage drill. They have a runbook for the database going down.
Then a fiber line gets cut by construction equipment in three cities simultaneously. Or a nation-state actor launches a coordinated multi-vector DDoS attack at 3am on a Sunday. Or—my personal favorite from 2019—a BGP misconfiguration routes half of Europe's traffic through Pakistan for eight minutes.
These aren't hypotheticals. They're Tuesday mornings in the cybersecurity world.
After 20 years defending critical infrastructure at Link11, I've learned one hard truth: if you only build for probable failure, the improbable failure will destroy you.
The 1-in-1000 Year Event Is Happening Every Quarter
Financial engineers love to talk about "tail risk"—the low-probability, high-impact events that sit in the far end of the distribution curve. Black swans. Fat tails. The 2008 financial crisis was supposed to be a once-in-a-century event.
It happened again in 2020 with COVID-19.
In cybersecurity and infrastructure, we don't have the luxury of treating catastrophic failure as a statistical outlier. The attack surface is too large. The adversaries are too sophisticated. The dependencies are too complex.
Here's what I mean:
- DDoS attacks exceeding 1 Tbps were "impossible" until 2016. Now we see them monthly.
- Zero-day exploits in core infrastructure (Log4j, Heartbleed, Spectre) were rare. Now they're annual events.
- Supply chain compromises (SolarWinds, 3CX) were theoretical. Now they're a primary attack vector.
The 1-in-1000 year event is happening every quarter somewhere in the tech ecosystem. The only question is whether you're ready when it hits you.
How We Design for Catastrophic Failure at Link11
Most infrastructure is built with redundancy in mind: two data centers, failover databases, load-balanced web servers. That covers the 99th percentile.
Strategic resilience goes further. It assumes that multiple systems fail simultaneously in ways you didn't anticipate.
Here's how we think about it:
1. Multi-Region, Multi-Cloud, Multi-Provider
We don't just replicate across availability zones. We replicate across geographically independent regions with independent network backbones.
When a BGP routing attack hits Frankfurt, our scrubbing nodes in Amsterdam and Paris keep running. When AWS has an outage in eu-central-1, our hybrid cloud architecture (AWS + Hetzner + on-prem) keeps us online.
This isn't paranoia. This is pattern recognition from watching the same failure modes repeat across two decades.
2. Assume Every Dependency Will Fail
Your payment processor will go down. Your DNS provider will have an outage. Your logging service will drop packets.
Most companies treat these as "acceptable risks." We treat them as design constraints.
- Every critical API call has a timeout and a fallback.
- Every external dependency has a circuit breaker.
- Every service can degrade gracefully instead of cascading failure.
The best example: our DDoS scrubbing pipeline. If the machine-learning classification layer fails, we fall back to signature-based detection. If that fails, we fall back to rate limiting by IP. If that fails, we trigger manual override.
There is always a fallback.
3. The "Chaos Engineering" Rule: Break It Before Attackers Do
Netflix popularized the idea of deliberately breaking your own infrastructure to test resilience. We've been doing this since 2010—long before it had a name.
Every quarter, we run a "kill switch" exercise:
- What if our primary data center loses power?
- What if our CDN gets poisoned by a DNS attack?
- What if three key engineers are unreachable during an outage (sick, vacation, or—God forbid—targeted)?
The goal isn't to prove the system works. It's to discover where it doesn't—before attackers do.
4. Horizontal Scale + Vertical Depth
Horizontal scaling (add more servers) is table stakes. But it only protects against volume problems.
Vertical depth—having low-level control over the network stack, the OS, the hardware—protects against existential problems.
Example: When we face a sophisticated DDoS attack that exploits Linux kernel behavior, we can't wait for a Ubuntu patch. We patch the kernel ourselves, test in staging, and deploy in under an hour.
Most SaaS companies can't do this. They're abstracted away from the metal. That abstraction is convenient—until it isn't.
The CEO Mindset: Insurance vs. Investment
Here's where most companies fail: they treat resilience as insurance—a cost center that hopefully never gets used.
I treat it as an investment—a competitive moat that compounds over time.
Why?
- Customer trust: When we stay online during a major attack and our competitors don't, we don't just retain customers—we win new ones.
- Regulatory advantage: As compliance regimes tighten (NIS2, DORA, GDPR), our resilience architecture becomes a requirement for enterprise deals.
- Operational leverage: When our infrastructure can survive worst-case scenarios, our engineers sleep better. Lower burnout = higher retention = institutional knowledge stays in-house.
The ROI isn't immediate. But over 10 years? It's the difference between a company that survives crises and one that gets acquired in a fire sale.
The Hardest Part: Convincing Your Board (And Yourself)
Let's be honest: building for the 1-in-1000 year event is expensive.
- Multi-region deployments cost 2-3x more than single-region.
- Chaos engineering requires dedicated team time.
- Vertical depth (owning the stack) means hiring expensive kernel engineers instead of cheap DevOps generalists.
Your CFO will ask: "Why are we spending $2M/year on something that might never happen?"
My answer:
"Because the one time it does happen, it will cost us $20M in revenue, $50M in market cap, and a decade of customer trust. And unlike a financial hedge, this investment also makes us faster and more reliable every single day."
This isn't fear-mongering. It's actuarial math combined with two decades of scar tissue.
The European Advantage: Precision Engineering Culture
There's a reason Link11 is based in Frankfurt, not San Francisco.
Silicon Valley optimizes for growth at all costs. Move fast, break things, iterate in production.
That works brilliantly for social media apps. It's catastrophic for critical infrastructure.
German engineering culture—precision, thoroughness, planning for worst-case—aligns perfectly with the strategic resilience mindset. We don't ship features that haven't been stress-tested under failure conditions.
Is it slower? Yes.
Is it boring? Absolutely.
Does it keep our customers online when the world is on fire? Every single time.
What This Means for You
You probably don't need Link11-level resilience. Most companies don't.
But if you're in:
- Financial services (where downtime = regulatory fines + customer exodus)
- Healthcare (where downtime = literal life-and-death)
- Critical infrastructure (energy, logistics, telecom)
- High-stakes SaaS (where your uptime is your customers' uptime)
...then you need to start thinking like a paranoid systems architect.
Ask yourself:
- What's my single point of failure? (There's always one. Find it.)
- What happens if my primary cloud region disappears for 6 hours?
- Can I survive a coordinated attack on my weakest dependency?
- Do I have a manual override for every automated system?
If you can't answer these confidently, you're not building for resilience. You're building for hope.
Final Thought: The Lindy Effect of Infrastructure
There's a concept called the Lindy Effect: the longer something has survived, the longer it's likely to survive in the future.
Technologies that have been battle-tested for decades (TCP/IP, Postgres, Nginx) are more resilient than shiny new frameworks that promise the world.
The same applies to companies.
Link11 has survived:
- The 2008 financial crisis
- The 2016 Mirai botnet (the largest DDoS attack in history at the time)
- The 2020 pandemic (overnight shift to remote operations with zero downtime)
- Countless BGP hijacks, fiber cuts, and nation-state attacks
We're still here—not because we're lucky, but because we designed for the worst and operated for the best.
That's strategic resilience.
And it's the only way to build something that lasts.
Follow the journey
Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.
Subscribe →