The Hidden Economics of Data at Scale
In 2018, we made a decision that seemed smart at the time: we moved 40TB of security logs into AWS S3. The ingestion was free. The storage was cheap—$0.023 per GB/month. The analytics pipeline ran smoothly on EC2. Everything looked great on the spreadsheet.
Then we tried to move it.
Egress fees hit us like a freight train. At $0.09 per GB, moving that data out would cost $3,600 per terabyte. For 40TB, that's $144,000—just to reclaim our own data. The project we were migrating to? It had a $60,000 annual budget.
Welcome to the data gravity trap.
What Is Data Gravity?
Data gravity is a term coined by Dave McCrory in 2010, but it's more relevant today than ever. The idea is simple: as data accumulates, it exerts a gravitational pull on applications and services. The larger the dataset, the harder—and more expensive—it becomes to move.
This isn't just about egress fees (though those are brutal). It's about:
- Transfer time: Moving petabytes takes weeks, even on dedicated fiber.
- Operational risk: Every migration is a potential outage or data loss event.
- Opportunity cost: Your engineering team is stuck babysitting data pipelines instead of building features.
- Vendor lock-in: Once your data is "in the well," switching providers becomes economically irrational.
The result? Your cloud bill only goes up.
The Trap Is Deliberate
Cloud providers aren't stupid. They know the real money isn't in compute or storage—it's in keeping you locked in.
Here's the playbook:
- Make ingestion free: Get your data in. No friction. "Just upload it and see what happens!"
- Make storage cheap: At scale, S3 costs pennies per GB. You'll never notice the bill until it's massive.
- Make egress expensive: Want to leave? That'll be $0.09/GB. For a petabyte, that's $90,000.
The trap closes when your dataset reaches a certain threshold—usually around 10-50TB. Below that, migration is annoying but feasible. Above it, the cost and complexity make it economically irrational to leave.
At that point, you're not a customer anymore. You're a hostage.
Real-World Example: The Link11 Log Migration
At Link11, we handle billions of security events per day. DDoS traffic patterns, threat intelligence feeds, BGP route tables—it all gets logged for analysis and compliance.
In 2019, we hit a crossroads. Our primary analytics provider was raising prices. A competitor offered a better feature set at half the cost—but it required moving 200TB of historical data to their platform.
The math was brutal:
- Egress cost: $18,000 (200TB × $0.09/GB)
- Ingestion cost: $0 (competitor waived it)
- Transfer time: 6 weeks over a 10Gbps dedicated line
- Engineering cost: 3 engineers, full-time, for 2 months
We stayed. Not because the incumbent was better—but because the cost of leaving exceeded the savings. That's data gravity in action.
The Compounding Problem
Here's what makes this worse: data gravity compounds.
Every service that reads from that dataset creates a new dependency. Your analytics pipeline, your machine learning models, your compliance reports—they all start to assume the data lives in one place.
Now it's not just about moving the data. It's about:
- Rewriting queries for a new database engine
- Updating API endpoints in 47 microservices
- Retraining models on a new data schema
- Reconfiguring IAM policies, VPC peering, and security groups
The technical debt of migration grows exponentially with the size of the dataset.
The Egress Fee Illusion
Everyone focuses on egress fees because they're visible. But they're only 30-40% of the total cost of migration.
The hidden costs are:
- Downtime risk: Can your business survive 6 hours of read-only mode during the cutover?
- Data validation: How do you prove that 500TB transferred correctly? Checksums at scale are non-trivial.
- Dual-write overhead: Most migrations require running both systems in parallel for weeks. Double the infrastructure cost.
- Rollback complexity: If the migration fails halfway, how do you roll back 100TB of writes?
These aren't line items on an AWS invoice, but they're real costs—paid in engineering time, system complexity, and risk.
How to Escape the Trap
The best time to avoid data gravity was before you started. The second-best time is now.
1. Design for Portability from Day One
Use open standards and avoid proprietary features. If you build on DynamoDB's native query language, you're locked in. If you use Postgres with standard SQL, you can move to RDS, Aurora, or bare-metal Postgres without rewriting code.
Principle: Your data layer should be boring. Use the most standard, portable stack you can tolerate.
2. Implement Multi-Region Replication Early
If your data already lives in two clouds (AWS + GCP, for example), the marginal cost of moving to a third is much lower. You've already solved the hard problems: schema translation, networking, and eventual consistency.
Trap avoidance: Replication is expensive upfront, but it keeps you optionality alive.
3. Separate Hot and Cold Data
Not all data is equal. Your last 30 days of logs? Hot. Queried constantly. Must be fast.
Your 3-year-old compliance archives? Cold. Accessed once a quarter, if ever.
Store cold data in Glacier or equivalent, and accept that it's nearly impossible to move. For hot data, keep it in a portable format (Parquet, Avro, etc.) and replicate aggressively.
4. Negotiate Egress Waivers
Most enterprise cloud contracts have hidden flexibility. If you're spending $500k/year with AWS, you can negotiate egress waivers for migrations—especially if you're moving within their ecosystem (e.g., S3 to Snowflake on AWS).
Insider tip: Cloud providers would rather waive egress fees than lose a large customer entirely.
5. Use Hybrid / Edge Architectures
The future isn't "all cloud" or "all on-prem"—it's strategic distribution.
At Link11, we now run:
- Real-time traffic analysis: On-prem (latency-sensitive, petabyte-scale)
- Machine learning training: Cloud (bursty GPU needs)
- Long-term archives: Tape and object storage (write-once, read-never)
This isn't "multi-cloud" in the buzzword sense—it's deliberate data placement based on access patterns and cost.
The Mental Model Shift
Most teams think about cloud costs like this:
"Storage is $0.02/GB. Compute is $0.10/hour. NBD."
The correct model is:
"Storage is cheap. Leaving is expensive. Every TB we add is a future liability."
This doesn't mean "don't use the cloud." It means:
- Be intentional about what data you store and where.
- Architect for portability even if you never plan to move.
- Treat vendor lock-in as a risk, not an inevitability.
The Real Cost of Data Gravity
The scariest part? Most companies don't realize they're trapped until it's too late.
You start small. A few terabytes. "We'll move it later if we need to." Then it's 50TB. Then 500TB. Then someone runs the numbers and realizes: migration would cost more than two years of current spend.
At that point, you're not making technical decisions anymore. You're making financial ones—and the gravity well has you.
The Way Forward
Data gravity isn't going away. If anything, it's accelerating. AI workloads require more data, not less. Vector databases, embeddings, training corpuses—every new AI feature adds another 10TB to the pile.
The winners in the next decade won't be the companies with the most data. They'll be the companies that control where their data lives—and can move it when the economics change.
Design for portability. Negotiate aggressively. And never, ever assume your cloud bill will go down.
Because once you're in the gravity well, escape velocity is measured in hundreds of thousands of dollars.
And your CFO will not be happy.
Jens-Philipp Jung is CEO of Link11, where he's spent 20 years defending critical infrastructure from DDoS attacks, nation-state actors, and surprisingly expensive cloud bills. He's moved petabytes of data more times than he'd like to admit.
Follow the journey
Subscribe to Lynk for daily insights on AI strategy, cybersecurity, and building in the age of AI.
Subscribe →