Home About Projects Blog Subscribe Login

The Network Is Not Reliable (But Your App Can Be)

Packet loss happens. BGP flaps happen. Fiber cuts happen. If your app assumes a perfect pipe, it's brittle. Lessons from 20 years of building DDoS defense: design for a hostile network.

"

The Fantasy of the Perfect Network

\n\n

Every software engineer starts with a beautiful lie: the network is reliable.

\n\n

We build applications assuming packets arrive in order, connections stay open, and latency is predictable. We write fetch() and expect a response. We open a WebSocket and assume it stays connected. We cache DNS and forget it expires.

\n\n

Then the real world happens.

\n\n

A fiber cut in Kansas takes down half your API traffic. A BGP misconfiguration routes your requests to Pakistan. A DNS resolver flaps and your users see \"Service Unavailable\" for 90 seconds. Your app, so elegant in staging, becomes a brittle house of cards.

\n\n

After 20 years of building DDoS defense systems at Link11—where we protect infrastructure against the most hostile network conditions imaginable—I've learned one truth: the network will fail you. Your job is to fail gracefully.

\n\n

The Eight Fallacies (Still True in 2026)

\n\n

In 1994, Sun Microsystems engineers identified the \"Fallacies of Distributed Computing.\" More than 30 years later, they're still gospel:

\n\n
    \n
  1. The network is reliable
  2. \n
  3. Latency is zero
  4. \n
  5. Bandwidth is infinite
  6. \n
  7. The network is secure
  8. \n
  9. Topology doesn't change
  10. \n
  11. There is one administrator
  12. \n
  13. Transport cost is zero
  14. \n
  15. The network is homogeneous
  16. \n
\n\n

Every one of these assumptions is wrong. And yet, most applications are built as if they're true.

\n\n

The modern cloud-native stack—microservices, API gateways, service meshes—has added more network hops, not fewer. Every call is a potential failure point. If you're not designing for network failure, you're designing for production incidents.

\n\n

Lesson 1: Timeouts Are Not Optional

\n\n

The fastest way to kill your application is to wait forever for a response that will never come.

\n\n

I've seen production systems hang for minutes because a single database connection didn't have a timeout. The app kept waiting. Threads piled up. Memory leaked. Eventually, the entire service became unresponsive—not because the database was down, but because we didn't give up fast enough.

\n\n

Every network operation needs three timeouts:

\n\n\n\n

If you don't set them, the defaults are usually absurd. Some HTTP clients default to infinity. That's not defensive programming—that's a time bomb.

\n\n

Lesson 2: Retries Must Be Idempotent (And Exponential)

\n\n

Retrying a failed request sounds simple. It's not.

\n\n

If your API charges a credit card, retrying on failure might charge twice. If it sends an email, retrying might spam the user. If it triggers a deploy, retrying might cause a race condition that corrupts state.

\n\n

Rule 1: Only retry idempotent operations. If you can't safely repeat the action, don't auto-retry—log it and alert a human.

\n\n

Rule 2: Use exponential backoff with jitter. If 1,000 clients all fail at the same time and retry immediately, you've just created a self-inflicted DDoS attack. Spread out the retries:

\n\n
wait = min(base * (2 ** attempt) + random_jitter, max_wait)
\n\n

Start at 100ms, double each time, add randomness, and cap at something reasonable (10-30 seconds). This prevents the \"thundering herd\" problem and gives the downstream service time to recover.

\n\n

Lesson 3: Circuit Breakers Save Lives (And Uptime)

\n\n

Imagine your app calls an external API. That API goes down. Without a circuit breaker, every request to your app will wait for the timeout, retry a few times, and eventually fail—slowly.

\n\n

Your users see 30-second page loads. Your error rate spikes. Your monitoring dashboard lights up like a Christmas tree.

\n\n

A circuit breaker stops this cascade. After a threshold of failures (say, 5 in 10 seconds), it \"opens\" and stops calling the broken service entirely. Instead of waiting, it fails fast—returning a cached response, a degraded experience, or a clear error.

\n\n

After a cool-down period, it tries again (\"half-open\"). If the service has recovered, it closes and resumes normal operation. If not, it stays open.

\n\n

This pattern—borrowed from electrical engineering—prevents one failing dependency from taking down your entire system. We use it everywhere at Link11: database calls, third-party APIs, internal microservices. It's non-negotiable.

\n\n

Lesson 4: Fallbacks Are Your Safety Net

\n\n

When the network fails, what does your app do?

\n\n

Most apps: crash, hang, or show a generic error page.

\n\n

Resilient apps: fall back to a degraded but functional mode.

\n\n\n\n

This is the difference between an outage and an \"unnoticed degradation.\" Your users might not get the perfect experience, but they can still accomplish their goal. That's what matters.

\n\n

Lesson 5: Monitor What Actually Breaks

\n\n

Most monitoring dashboards track the wrong things.

\n\n

They show you CPU usage, memory consumption, and request counts. These are symptoms. What you actually need to know is: Can users complete their core workflows?

\n\n

We track synthetic transactions—automated scripts that simulate real user behavior every 60 seconds from multiple regions. If the script can't log in, can't load the dashboard, or can't submit a form, we know immediately—not when the first angry support ticket arrives.

\n\n

We also track error budgets: a quota of acceptable failures per month. If we burn through 50% of our budget in the first week, we freeze feature work and focus on reliability. It's not just a metric—it's a forcing function.

\n\n

Lesson 6: DNS Is a Single Point of Failure (And You're Probably Ignoring It)

\n\n

Your app might be perfect. Your infrastructure bulletproof. But if your DNS goes down, you're offline.

\n\n

We've seen attacks that flood DNS resolvers with garbage queries, exhausting their capacity. We've seen typo-squatting domains that hijack traffic. We've seen cache poisoning attacks that redirect users to phishing sites.

\n\n

Best practices:

\n\n\n\n

If you're running your own authoritative nameservers, you're either extremely skilled or extremely reckless. Delegate this to specialists.

\n\n

Lesson 7: BGP Hijacking Is Rare (But Devastating)

\n\n

In 2018, an attacker announced Amazon's IP space from their own network. For a few minutes, Route 53 traffic was routed to the wrong place, and cryptocurrency wallets were drained.

\n\n

BGP (Border Gateway Protocol) is the routing system of the internet. It's based on trust. Any network operator can announce \"I have the best route to this IP,\" and routers around the world will believe them.

\n\n

There's no built-in authentication. It's a gentlemen's agreement from the 1980s, running the modern internet.

\n\n

What you can do:

\n\n\n\n

If you're operating at scale, you should have someone on your team who understands BGP. If not, you're flying blind.

\n\n

Lesson 8: Packet Loss Is Normal (So Handle It)

\n\n

On a good day, the internet loses about 0.1% of packets. On a bad day, it can be 5-10%. Across the ocean? Even higher.

\n\n

TCP handles this automatically with retransmissions, but that adds latency. If your app is sensitive to latency (real-time gaming, video calls, financial trading), you need to design around packet loss:

\n\n\n\n

For most apps, TCP is fine—but you need to understand why it's slow sometimes. It's not your server. It's the hostile network between you and the user.

\n\n

The Mindset Shift: Design for Chaos

\n\n

The best lesson I've learned in 20 years of cybersecurity and infrastructure is this: stop assuming stability.

\n\n

The network is not a perfect pipe. It's a battlefield. Packets get lost. Routers get misconfigured. Cables get cut by backhoes. Malicious actors flood your servers with garbage traffic.

\n\n

If your application can't handle this reality, it's not production-ready—it's a prototype running in a fantasyland.

\n\n

The antidote is simple:

\n\n\n\n

This isn't paranoia. It's pragmatism.

\n\n

The network will betray you. The only question is whether you're ready when it does.

\n\n

— Jens-Philipp Jung, CEO Link11

"

Follow the journey

Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.

Subscribe →