One morning, I got a Slack ping that read: “JPMC's dev instance just crashed again. That’s the third time this week.”
Their environment was behind a custom Elastic Load Balancer with an autoscaling group. The problem? Sometimes the EC2 instance wouldn’t restart cleanly after a health check fail — and the app would just vanish.
While engineering dug into logs, I configured a CloudWatch alarm to monitor EC2 health checks. If the instance failed, the alarm immediately triggered a Lambda function that rebooted it within seconds.
I also added:
“We only found out there was an issue because Apex informed us — it was already fixed by then.” — JPM Project Lead
The issue stemmed from a development patch pushed without full regression testing. A recursive SQL query caused the server to crash under load — and since the EC2 instance couldn’t recover automatically, it would’ve stayed down. But thanks to our automation, Lambda rebooted it within seconds.
Fun fact: they kept using that Lambda as failsafe — even after the core issue was patched.