One morning, I got a Slack ping that read: "JPMC's dev instance just crashed again. That is the third time this week."
Their environment was behind a custom Elastic Load Balancer with an autoscaling group. The problem? Sometimes the EC2 instance would not restart cleanly after a health check fail, and the app would just vanish.
Temporary Fix? Sure. But Make It Safe.
While engineering dug into logs, I configured a CloudWatch alarm to monitor EC2 health checks. If the instance failed, the alarm immediately triggered a Lambda function that rebooted it within seconds.
I also added:
- 🔔Slack alerts via SNS for visibility
- 📊CloudWatch metrics to track restart frequency
- ⏲️A "backoff" timer so we did not enter an infinite reboot loop
"We only found out there was an issue because Apex informed us. It was already fixed by then." — JPM Project Lead
Postmortem Follow-Up
The issue stemmed from a development patch pushed without full regression testing. A recursive SQL query caused the server to crash under load. Since the EC2 instance could not recover automatically, it would have stayed down. But thanks to our automation, Lambda rebooted it within seconds.
Buy time, build trust. Stability opens the door to deeper adoption.
SE Takeaways
- ⏰Do not wait for a root cause to act. SEs bridge the now and the later.
- 🤝Buy time, build trust. Stability opens the door to deeper adoption.
- 🏈Incident response is a team sport. But you can quarterback the play.
Fun fact: they kept using that Lambda as failsafe even after the core issue was patched.
More from the SE Blog