🛠️ From Crashes to Confidence: How I Kept JPMorgan’s Servers Online

#IncidentResponse #ClientTrust #ReliabilityEngineering

One morning, I got a Slack ping that read: “JPMC's dev instance just crashed again. That’s the third time this week.”

Their environment was behind a custom Elastic Load Balancer with an autoscaling group. The problem? Sometimes the EC2 instance wouldn’t restart cleanly after a health check fail — and the app would just vanish.

Temporary Fix? Sure. But Make It Safe.

While engineering dug into logs, I configured a CloudWatch alarm to monitor EC2 health checks. If the instance failed, the alarm immediately triggered a Lambda function that rebooted it within seconds.

I also added:

“We only found out there was an issue because Apex informed us — it was already fixed by then.” — JPM Project Lead

Postmortem Follow-Up

The issue stemmed from a development patch pushed without full regression testing. A recursive SQL query caused the server to crash under load — and since the EC2 instance couldn’t recover automatically, it would’ve stayed down. But thanks to our automation, Lambda rebooted it within seconds.

SE Takeaways

Fun fact: they kept using that Lambda as failsafe — even after the core issue was patched.