The Data Deluge
Your monitoring dashboard has 847 metrics. Your logging pipeline processes 2TB of data per day. Your traces capture every HTTP request, database query, and inter-service call.
And when your site goes down at 3am, you still can't figure out why.
This is the observability paradox: more instrumentation creates less visibility. The promise was clarity—the reality is noise at scale.
After 20 years building and operating critical infrastructure at Link11, I've watched observability evolve from "ssh into the box and tail the logs" to distributed tracing with OpenTelemetry spans. The tools got better. The problem got worse.
Here's what actually works.
Why More Data Doesn't Mean More Insight
The first-generation observability pitch was seductive: instrument everything. Logs, metrics, traces—the three pillars of observability. Capture it all, store it forever, query it when you need it.
The problem? Human attention doesn't scale with data volume.
Your engineer wakes up to a PagerDuty alert. They have:
- 847 Prometheus metrics to sift through
- 50,000 log lines per second streaming into Elasticsearch
- Distributed traces spanning 23 microservices
- 8 different dashboards across 3 different tools
They have 5 minutes to find the root cause before customers start churning.
They will fail.
Not because they're bad engineers. Because the signal-to-noise ratio is broken.
The Four Sins of Modern Observability
Sin #1: Instrumenting Everything
Not all metrics matter equally. Your cache hit rate? Critical. The temperature of your database server's CPU? Interesting, but irrelevant 99% of the time.
Every metric you emit costs money to store, CPU cycles to process, and cognitive load to interpret. Most teams never do the cost-benefit analysis.
At Link11, we learned this the hard way during a DDoS incident. Our monitoring stack was consuming so much bandwidth logging the attack traffic that it became part of the problem. We were DDoS-ing ourselves with observability data.
Sin #2: Alerting on Symptoms, Not Impact
"CPU above 80%" is a symptom. "API latency P95 above 500ms" is impact.
Too many teams alert on resource utilization instead of user experience. The result? Alert fatigue, false positives, and the dreaded "everything's red but the site works fine" scenario.
The fix: instrument your SLOs, not your infrastructure. Start with what the user experiences—response time, error rate, availability—and work backwards.
Sin #3: Storing Logs Forever
Log retention is expensive. Elasticsearch at scale is really expensive.
Most logs are useless after 7 days. Some logs are useless after 7 minutes. Yet teams default to "keep everything for a year" because storage is cheap and FOMO is expensive.
The better approach: tiered retention. Hot logs (last 24 hours) stay in fast storage. Warm logs (7 days) go to cheaper storage. Cold logs (compliance, auditing) get archived to S3.
We cut our logging bill by 70% with this one change.
Sin #4: Tool Sprawl
Datadog for metrics. Splunk for logs. Jaeger for traces. Sentry for errors. PagerDuty for alerting.
Each tool is best-in-class. Together, they're a context-switching nightmare.
When you're troubleshooting an incident, every tool transition costs 30 seconds of cognitive load. That adds up fast when you're trying to correlate a spike in errors with a deployment event across three different UIs.
Consolidation isn't sexy, but it saves lives at 3am.
The High-Signal Observability Stack
So what's the alternative? Here's the framework I use:
1. Start with SLOs, not SLIs
Define what "working" means for your users. API uptime? Latency? Data freshness? Pick 3-5 critical indicators and build everything else around them.
These become your error budget. Everything else is commentary.
2. Alert on rate-of-change, not thresholds
"Error rate above 1%" is a bad alert if your baseline is 0.8%. "Error rate increased 5x in the last 5 minutes" is a great alert.
Thresholds are brittle. Anomaly detection is adaptive.
3. Sample aggressively
You don't need to trace every request. You don't need to log every event.
Sample 1% of successful requests. Log 100% of errors. Trace 10% of slow requests. This gives you the signal without the storage bill.
We process 50 billion requests per day at Link11. If we traced every one, we'd spend more on observability than infrastructure. Sampling at 0.1% still gives us millions of data points—more than enough to debug any issue.
4. Build runbooks, not dashboards
Dashboards are for exploring. Runbooks are for responding.
When an alert fires, your engineer shouldn't have to guess which dashboard to open or which queries to run. The runbook should say: "Check X, then Y, then Z. If none of those show anomalies, check the database."
Automate the runbook where possible. If a script can check something faster than a human, let the script do it.
5. Instrument your observability stack
Meta, I know. But your monitoring can fail too.
We once had an incident where our logging pipeline fell behind by 20 minutes. Engineers were looking at stale data and making decisions based on outdated information. The site was fine—our visibility into it wasn't.
Now we alert if log ingestion latency exceeds 10 seconds. Observability observing itself closes the loop.
The Real Bottleneck: Human Attention
All of this comes back to one constraint: human attention is finite.
You can scale your infrastructure. You can scale your data pipeline. You cannot scale your on-call engineer's ability to process information under pressure.
The best observability stack isn't the one with the most features. It's the one that delivers the right information to the right person at the right time—and hides everything else.
At Link11, we protect over a million IP addresses. When an attack hits, we have seconds to respond. Our observability philosophy is simple: show me what's broken and how to fix it. Everything else is distraction.
That clarity doesn't come from more data. It comes from ruthless prioritization.
The Way Forward
The observability industry is maturing. We're moving past the "instrument everything" phase into the "instrument what matters" phase.
Tools like Honeycomb and Lightstep are building smarter sampling. Grafana is consolidating logs, metrics, and traces into a single pane of glass. OpenTelemetry is standardizing instrumentation so we can stop rewriting integrations every 18 months.
But the tooling is only half the solution. The other half is discipline:
- Instrument your SLOs, not your vanity metrics
- Alert on impact, not symptoms
- Sample aggressively to control costs
- Build runbooks that guide response
- Consolidate tools to reduce context switching
Do this, and you'll spend less time drowning in dashboards and more time actually fixing things.
Because at the end of the day, observability isn't about how much data you collect. It's about how fast you can get from "something's wrong" to "here's the fix."
Everything else is noise.
Follow the journey
Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.
Subscribe →