💥 How I Stopped AWS Config Drift Before It Broke Prod
🧨 The Problem
One day, a change went live in our AWS environment that no one remembered making.
Security groups had shifted. Auto-scaling thresholds were off.
What we had in Terraform… didn’t match reality.
I built this while working at Apex Analytix, where even minor infrastructure changes could impact supplier portals and audit workflows. We needed early warnings — not postmortems.
Welcome to the world of configuration drift — where even small changes can cause cascading chaos.
🧪 The Challenge
AWS Config tells you what changed… eventually.
But I needed instant visibility — not hours later, not after something broke.
I wanted something that could:
- ✅ Compare live infra against baseline configs
- ✅ Alert on changes immediately
- ✅ Roll back if needed (with a toggle)
🛠️ The Fix: Detect + Alert + Revert (Optional)
I built a system using:
- 🧠 AWS Lambda + CloudWatch Events → trigger checks every 10 mins
- 🗂️ Baseline stored as JSON snapshot in S3
- 📩 Drift alert email with change details
- ♻️ Optional auto-revert if critical resources are touched
It’s like a tripwire for your cloud — silent until something shifts.
📬 GitHub Link
Recreated version with safe sample configs:
👉
github.com/chinmaya-chhatre/configuration-drift-detector
📈 What Changed
- 🔒 Fewer surprise changes in prod
- 📉 Mean Time to Detect config issues dropped by 60%
- 💬 Helped SRE + Security stay in sync with dev infra updates
⚖️ Tradeoffs I Made
- Snapshot vs. Live Compare: Used JSON snapshot in S3 instead of live Terraform state — faster, easier to trigger via Lambda
- Optional Auto-Revert: Didn’t want infra reverting behind engineers’ backs — approval-based auto-revert keeps humans in control
- Manual Baseline Updates: Requires conscious version updates — safer, but adds ops overhead
🧠 What I'd Add Next
- 📊 Dashboard to visualize drift over time
- 🔐 Role-based control over what’s allowed to drift
- 🔁 Auto-tagging for anything outside baseline
🧵 Why I’m Sharing This
Because “it worked in staging” isn’t helpful when prod breaks.
Because config drift is invisible — until it isn’t.
And because detecting drift is just as critical as preventing it.
📎 Bonus Links