Whenever people talk about internal developer platforms (IDPs), the conversation usually revolves around efficiency — faster deployments, a smoother developer experience, and better self-service workflows. While all of that is definitely important, one of the toughest lessons I picked up as a platform engineer is that success has a side effect: without the right controls in place, costs tend to creep up quietly in the background as the platform grows.
That’s what makes managing platform costs so slippery. It’s rarely one massive blunder; instead, it’s a build-up of small, reasonable choices made across different teams and environments. Maybe a developer spins something up for a quick test, a snapshot gets created and forgotten, or a non-production environment just keeps running because it slipped everyone’s mind. On their own, these decisions seem minor, but at scale, they really add up.
Looking back, there were three areas that were consistently the biggest headaches to keep under control: governance at scale, data storage sprawl, and environment sprawl. The takeaway was always the same: manual oversight just doesn’t cut it. The only thing that actually worked long-term was using automation with solid guardrails.
Governance and Enforcement at Scale
If I had to name the hardest cost problem to solve at scale, it would be governance.
I say that because without governance, every other optimization effort is fragile. You can shut down unused resources. You can rightsize instances. You can clean up storage. But if developers can still provision whatever they want, however they want, without the right scope, approvals, or tagging, those savings can disappear almost immediately.
In my experience, governance was not about slowing developers down. It was about creating safe, approved pathways that made the right choice easy to reach. We did that through what I think of as golden modules. These were approved patterns that had already been vetted by security and other stakeholders, and they were simple enough for developers to use without needing deep infrastructure expertise. That easy experience mattered. If the compliant path was too complicated, people would find other workarounds.
We also started integrating policy enforcement more directly into pipelines using tools like Open Policy Agent, alongside more traditional controls like service control policies and IAM boundaries. That combination gave us both preventive and runtime enforcement. It meant we could limit which configurations were acceptable, enforce tagging and policy requirements, and reduce the number of ways cost-heavy decisions could slip into the environment.
Why was that so important? Because oversizing and rogue provisioning happen more easily than people think. I’ve seen cases where someone spins up infrastructure far larger than what the workload actually needs simply because it was available and nobody stopped them. If a developer launches something dramatically oversized for a task that only needs a fraction of that capacity, you’re suddenly paying many times more per hour than necessary. It doesn’t take many of those mistakes to create real waste.
Interestingly, this happened far less in production once teams were funneled into stronger operating guidelines. The real problem was dev and test. In those environments, it came up often—sometimes daily depending on the team and what they were working on.
That’s where the balance gets hard. You do not want cost optimization to become a tax on innovation. If your controls are so rigid that teams cannot experiment, test ideas, or improve the product, then you are losing in a different way. But that does not mean you accept unlimited drift. It means you refine the system until innovation can happen inside guardrails.
I came to believe governance is not optional for enterprise-scale platform cost control. Without it, you might get isolated wins on individual teams. With it, you get repeatable outcomes across the organization. Good governance turns cost control from a one-off cleanup project into an operating model.
Data Storage Sprawl
The second major category was data storage sprawl, and I’d actually widen that to include networking sprawl too, because the two were often closely connected in the bill.
This was one of the most common patterns I saw when joining a new team or starting a manual audit of an environment. I’d log in and find old EBS snapshots, RDS snapshots, backups, and large file systems full of data that had not been touched in 30, 60, or 90 days. In many cases, that data should have been archived to a lower-cost tier, snapshotted more intelligently, or removed entirely. Instead, it just sat there indefinitely.
This is one of the least intuitive parts of cloud cost management. In a traditional environment, people think in terms of fixed storage capacity. In the cloud, every one of those decisions is a buying decision. A snapshot that gets taken and forgotten still costs money. A file system full of stale data still costs money. A backup with no retention plan still costs money. If nobody tags it, tracks it, or owns it, it can sit there for years producing no real value while continuing to generate spend.
I saw the same pattern in networking. One team I worked with was paying roughly $12,000 a month in NAT gateway charges when cheaper alternatives were available. In some cases, VPC endpoints would have solved the problem at a fraction of the cost or none at all. The point is not that NAT gateways are bad. The point is that cloud networking choices often become expensive simply because nobody revisits them.
Another version of this problem showed up in test environments. Teams would import masked production-like data, run their tests, and then leave behind large volumes and supporting storage that were no longer needed. The test was completed. The infrastructure remained. And because nobody had built strong cost hygiene into the workflow, it just sat there.
A big takeaway from our storage cleanup was that cost optimization and security go hand in hand across all AWS resource types. Outdated backups, orphaned storage volumes, and lingering snapshots are more than just unnecessary expenses; they can also be attack vectors. While the only certain way to remove all risk is to avoid having any digital footprint, that isn’t a viable business strategy. So we prioritized acting as stewards for every part of our infrastructure. By doing so, we could effectively shrink our attack surface while simultaneously ensuring that resources weren’t being wasted on components that didn’t contribute to our success.
It’s a recurring lesson for me: platform cost work is rarely only about cost. Good hygiene usually improves several things at once.
Environment Sprawl
The third category was environment sprawl, and this one was relentless.
By environment sprawl, I mean the constant expansion of sandbox, dev, test, and QA resources that get created for legitimate reasons but never shut down once their immediate purpose is gone. This was especially common with compute-heavy resources and temporary supporting services.
One case I remember clearly involved specialized development instances that carried expensive licensing requirements. These were not cheap machines. They were costing us around $26 to $27 per hour in the dev environment, and there were multiple of them. When we looked at utilization, we found long stretches overnight with effectively zero use. No meaningful activity, no need for those resources to remain on, and yet they kept running.
That pattern extended beyond individual instances. Someone would spin up a Redis cluster to test an idea around latency, or bring up supporting infrastructure to validate a concept, and then move on. The environment kept growing because nothing forced those resources back out once the test was over.
The fix that worked best for us was an off-by-default model combined with automation. In test environments especially, we used a kind of time-to-live strategy. At a set time each evening, non-production resources were turned off by default. Developers who needed exceptions could still get what they needed through the right tags, IAM permissions, and process. But the baseline behavior changed from always on to intentionally on.
This one shift surfaced a huge amount of waste.
What we found was that many things that got turned off never needed to be turned back on. They had been left running simply because the environment made it easy to forget them. In some cases, I’d estimate that around 70% of the resources we powered down by default would have stayed on indefinitely if nobody had created that automation.
That number sounds wild until you remember how cloud infrastructure behaves. It’s easy to provision. It’s remote. It’s abstracted away from sight. If nobody is actively looking for unused resources, they can persist for months with very little friction.
This is why I keep coming back to automation. It was the consistent answer. Not because people were careless, but because human memory and manual cleanup do not scale. Automation gave us air cover. Guardrails kept teams from drifting too far, and when something slipped past those guardrails, automation helped us catch it, tag it, shut it down, or remove it later.
That is what made platform cost control sustainable.
The biggest lesson I took from all of this is that internal developer platform costs become hardest to control when ownership is unclear and cleanup depends on human follow-through. Governance, storage hygiene, and environment lifecycle management all get harder as the platform succeeds and adoption grows. That’s why the answer cannot be more meetings or more reminders. It has to be better systems.
For me, the winning approach was always some combination of clear pathways, strong guardrails, and relentless automation. That’s what let us support developer speed without letting costs spiral in the background.