Cloud Drift Detection: How to Resolve Out-of-State Changes

Cloud configurations change. All the time. It’s futile to imagine web app development without a constant stream of configuration changes in order to adopt new technologies, release new features, and support new business requirements.

For organizations managing their cloud provisioning processes with infrastructure as code (IaC), managing and facilitating those changes is more predictable and foolproof. But no matter how airtight your IaC implementation is, out-of-band changes—known as drift—are inevitable.

Simply put, drift is when configuration changes and is no longer identical to a previously populated value. In the case of IaC, drift occurs when real-life configuration differs from predetermined build-time states.

IaC and drift

One of the benefits of IaC is that it keeps drifts to a minimum, by codifying intended configuration and enforcing guardrails throughout the provisioning and deployment process. Barring “break glass” scenarios or “SREmergencies,” there should be no reason to log into a cloud console and make configuration changes directly in the UI or via CLI interfaces.

In reality, however, those scenarios and changes do happen. Maintenance and incident response tasks, for example, require ad hoc changes to existing configurations, for analysis and troubleshooting of ongoing issues. Knowledge and information gaps between teams can also result in changes, or strings of changes, that have drifted from the original plan.

Regardless of where your organization is in the IaC implementation journey, creating a systemic drift detection strategy is key to understanding and improving your cloud posture.

The keys to a successful IaC to IaaS drift detection strategy are to:

Understand when drift becomes a risk.
Implement drift detection automation relevant to your stack.
Classify and route their response to the right individual or team.

Understanding risks associated with drift

Any code-to-cloud gap can be classified as a drift, but that doesn’t mean all out-of-state configuration changes leave your infrastructure exposed. Of course, there are situations where ad hoc changes can cause environment instability, deployment issues, unpredictable costs, and gaps in security or compliance.

One of the most important considerations on this topic, however, is the potential risk of drift becoming a permanent fixture. When it comes to determining an account’s trust boundaries, like locking its networking setting or identity-based permissions, not knowing why and where a configuration was changed could result in unmanaged blind spots. At scale, that can cause rifts in those boundaries and result in catastrophic losses due to a data breach.

Imagine deliberately opening a port to the internet to troubleshoot DNS issues, or giving admin privilege to a role to complete a complex database migration process. If those settings were defined ad hoc using CLI and never reverted after the project or bug was resolved, there’s a possibility that they’ll become a permanent fixture. Those are exactly the forms of drift that could become an opportunity for an adversary attempting to infiltrate an exposed cloud asset.

On the other hand, not all drifts present real or exploitable risk — they might even be intentional. Understanding drift instances that may result from dynamic infrastructure resource creation or alteration to meet scalability requirements is important, especially before automating your drift detection.

Detecting drift between IaC and the cloud

The best way to mitigate potential risk is to identify altered configurations in real-time, understand the original intent, and revert them..

The most basic way to do that is to programmatically and continuously compare configuration within your IaaS with that of your IaC plan files. Because not all IaC frameworks persist states with the same cadence or at the same level of detail, how you do that will vary based on the methods you use to persist configuration states.

In declarative IaC frameworks like Terraform and Amazon Web Services CloudFormation, persisting a state is key to continuously evaluating the changes and origin planned files. Drift in these languages is pretty straightforward. Most of them have already exposed straightforward APIs to capture them with terraform apply or CloudFormation’s detect-stack-drift, for example. The main challenge with querying and evaluating those states is communicating across code and cloud in real-time, not just when provisioning new infrastructure.

Frameworks that allow an imperative control plane like Kubernetes (via kubectl) pose a bigger challenge to drift detection. With Kubernetes, the more you use imperative commands in day-to-day workflows, the harder it becomes to detect and respond to drift effectively. While Kubernetes doesn’t include a native drift detection API, it offers some of the best tooling for persisting configuration — using admission controllers and webhook listeners.

Perhaps the most challenging configuration methods to detect drift within are semi-declarative frameworks like Serverless, AWS CDK, and Pulumi. Because they’re built on languages that include imperative operators, the variety of possible configuration permutations makes it challenging to evaluate what a state is and how it should be addressed.

Depending on your infrastructure management approach, there may be a tool or project out there that can help:

driftctl is a free and open-source CLI that tracks, analyzes, prioritizes and warns of infrastructure drift in Terraform and AWS
Kubediff is a tool for Kubernetes to show you the differences between your running configuration and your version-controlled configuration.
AWS supports ad hoc CloudFormation drift detection from the Console, CLI, or from your own code.
SaaS tools like Bridgecrew that address configuration in both build-time and runtime, may also provide drift detection from your IaaS provider and your IaC state files.

Responding to drift

Detecting drift is only half the battle. To determine how to effectively respond to drift, start by answering these questions:

Are trust boundary configurations completely locked down using git flows and organization policies?
Is IaC the preferred choice for generating a non-urgent infrastructure change?
Do “break glass” scenarios occur often, and are DevOps engineers responsible for reverting systems to their correct known states?
Are developers using conformed systems to apply changes directly in runtime?
Are drift detection capabilities in use to alert in real-time on newly found drifts?

Those answers will help determine who is notified, where feedback is inserted, the priority level assigned to drifts, and more.

Regardless of your approach, the one guiding principle you should adhere to in your drift detection efforts is to surface and resolve drifts as close to developers as possible. Because developers have the most comprehensive history and knowledge of any given configuration, routing issues directly to code owners can save hours of trying to figure out who made what change and why.

Relying on DevOps to trace the origin of drift by reverse-engineering the logic that went into its change may be futile. If you must go that route, start first by assessing its potential risk as you would a bug bounty submission, rather than blindly accepting it and investing time in fixing it. You may find that the change doesn’t actually need to be reverted but can, instead, be reconciled back into your IaC state file.

We often think of IaC as a one-way street, but as your configuration journey evolves, it’s not uncommon to take steps in either direction for optimal usability and a strong cloud posture.

Drift on!