The Network Is Not Reliable (But Your App Can Be)

The Fantasy of the Perfect Network

Every software engineer starts with a beautiful lie: the network is reliable.

We build applications assuming packets arrive in order, connections stay open, and latency is predictable. We write fetch() and expect a response. We open a WebSocket and assume it stays connected. We cache DNS and forget it expires.

Then the real world happens.

A fiber cut in Kansas takes down half your API traffic. A BGP misconfiguration routes your requests to Pakistan. A DNS resolver flaps and your users see "Service Unavailable" for 90 seconds. Your app, so elegant in staging, becomes a brittle house of cards.

After 20 years of building DDoS defense systems at Link11—where we protect infrastructure against the most hostile network conditions imaginable—I've learned one truth: the network will fail you. Your job is to fail gracefully.

The Eight Fallacies (Still True in 2026)

In 1994, Sun Microsystems engineers identified the "Fallacies of Distributed Computing." More than 30 years later, they're still gospel:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Every one of these assumptions is wrong. And yet, most applications are built as if they're true.

The modern cloud-native stack—microservices, API gateways, service meshes—has added more network hops, not fewer. Every call is a potential failure point. If you're not designing for network failure, you're designing for production incidents.

Lesson 1: Timeouts Are Not Optional

The fastest way to kill your application is to wait forever for a response that will never come.

I've seen production systems hang for minutes because a single database connection didn't have a timeout. The app kept waiting. Threads piled up. Memory leaked. Eventually, the entire service became unresponsive—not because the database was down, but because we didn't give up fast enough.

Every network operation needs three timeouts:

Connection timeout — how long to wait for the initial handshake (usually 2-5 seconds)
Read timeout — how long to wait for the first byte of a response (5-10 seconds)
Total timeout — absolute deadline for the entire operation (30-60 seconds max)

If you don't set them, the defaults are usually absurd. Some HTTP clients default to infinity. That's not defensive programming—that's a time bomb.

Lesson 2: Retries Must Be Idempotent (And Exponential)

Retrying a failed request sounds simple. It's not.

If your API charges a credit card, retrying on failure might charge twice. If it sends an email, retrying might spam the user. If it triggers a deploy, retrying might cause a race condition that corrupts state.

Rule 1: Only retry idempotent operations. If you can't safely repeat the action, don't auto-retry—log it and alert a human.

Rule 2: Use exponential backoff with jitter. If 1,000 clients all fail at the same time and retry immediately, you've just created a self-inflicted DDoS attack. Spread out the retries:

wait = min(base * (2 ** attempt) + random_jitter, max_wait)

Start at 100ms, double each time, add randomness, and cap at something reasonable (10-30 seconds). This prevents the "thundering herd" problem and gives the downstream service time to recover.

Lesson 3: Circuit Breakers Save Lives (And Uptime)

Imagine your app calls an external API. That API goes down. Without a circuit breaker, every request to your app will wait for the timeout, retry a few times, and eventually fail—slowly.

Your users see 30-second page loads. Your error rate spikes. Your monitoring dashboard lights up like a Christmas tree.

A circuit breaker stops this cascade. After a threshold of failures (say, 5 in 10 seconds), it "opens" and stops calling the broken service entirely. Instead of waiting, it fails fast—returning a cached response, a degraded experience, or a clear error.

After a cool-down period, it tries again ("half-open"). If the service has recovered, it closes and resumes normal operation. If not, it stays open.

This pattern—borrowed from electrical engineering—prevents one failing dependency from taking down your entire system. We use it everywhere at Link11: database calls, third-party APIs, internal microservices. It's non-negotiable.

Lesson 4: Fallbacks Are Your Safety Net

When the network fails, what does your app do?

Most apps: crash, hang, or show a generic error page.

Resilient apps: fall back to a degraded but functional mode.

If the recommendation API is down, show popular items instead of personalized ones
If the real-time pricing service fails, use the last cached price with a warning
If the avatar CDN is unreachable, show a default placeholder
If the notification service is broken, queue the message locally and retry later

This is the difference between an outage and an "unnoticed degradation." Your users might not get the perfect experience, but they can still accomplish their goal. That's what matters.

Lesson 5: Monitor What Actually Breaks

Most monitoring dashboards track the wrong things.

They show you CPU usage, memory consumption, and request counts. These are symptoms. What you actually need to know is: Can users complete their core workflows?

We track synthetic transactions—automated scripts that simulate real user behavior every 60 seconds from multiple regions. If the script can't log in, can't load the dashboard, or can't submit a form, we know immediately—not when the first angry support ticket arrives.

We also track error budgets: a quota of acceptable failures per month. If we burn through 50% of our budget in the first week, we freeze feature work and focus on reliability. It's not just a metric—it's a forcing function.

Lesson 6: DNS Is a Single Point of Failure (And You're Probably Ignoring It)

Your app might be perfect. Your infrastructure bulletproof. But if your DNS goes down, you're offline.

We've seen attacks that flood DNS resolvers with garbage queries, exhausting their capacity. We've seen typo-squatting domains that hijack traffic. We've seen cache poisoning attacks that redirect users to phishing sites.

Best practices:

Use multiple DNS providers (e.g., Cloudflare + AWS Route 53) with health-check-based failover
Set aggressive TTLs (60 seconds) for critical records so you can pivot quickly
Monitor DNS response times from multiple geographic regions
Use DNSSEC to prevent spoofing

If you're running your own authoritative nameservers, you're either extremely skilled or extremely reckless. Delegate this to specialists.

Lesson 7: BGP Hijacking Is Rare (But Devastating)

In 2018, an attacker announced Amazon's IP space from their own network. For a few minutes, Route 53 traffic was routed to the wrong place, and cryptocurrency wallets were drained.

BGP (Border Gateway Protocol) is the routing system of the internet. It's based on trust. Any network operator can announce "I have the best route to this IP," and routers around the world will believe them.

There's no built-in authentication. It's a gentlemen's agreement from the 1980s, running the modern internet.

What you can do:

Register your prefixes in IRR (Internet Routing Registry) databases
Use RPKI (Resource Public Key Infrastructure) to cryptographically sign your route announcements
Monitor for unexpected route changes using services like BGPmon or RIPE RIS
Work with your ISP/transit provider to filter invalid announcements

If you're operating at scale, you should have someone on your team who understands BGP. If not, you're flying blind.

Lesson 8: Packet Loss Is Normal (So Handle It)

On a good day, the internet loses about 0.1% of packets. On a bad day, it can be 5-10%. Across the ocean? Even higher.

TCP handles this automatically with retransmissions, but that adds latency. If your app is sensitive to latency (real-time gaming, video calls, financial trading), you need to design around packet loss:

Use UDP with application-level retransmission (like QUIC) for more control
Send redundant data (forward error correction) so lost packets can be reconstructed
Adapt your bitrate or quality dynamically based on observed loss

For most apps, TCP is fine—but you need to understand why it's slow sometimes. It's not your server. It's the hostile network between you and the user.

The Mindset Shift: Design for Chaos

The best lesson I've learned in 20 years of cybersecurity and infrastructure is this: stop assuming stability.

The network is not a perfect pipe. It's a battlefield. Packets get lost. Routers get misconfigured. Cables get cut by backhoes. Malicious actors flood your servers with garbage traffic.

If your application can't handle this reality, it's not production-ready—it's a prototype running in a fantasyland.

The antidote is simple:

Assume every network call will fail
Set aggressive timeouts
Retry intelligently with backoff
Use circuit breakers to fail fast
Build fallbacks for degraded experiences
Monitor real user flows, not just server health
Treat DNS and BGP as critical dependencies
Embrace packet loss as a design constraint

This isn't paranoia. It's pragmatism.

The network will betray you. The only question is whether you're ready when it does.

— Jens-Philipp Jung, CEO Link11

Follow the journey

Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.

Subscribe →