🛠️ From Crashes to Confidence: How I Kept JPMorgan’s Servers Online

#IncidentResponse #ClientTrust #ReliabilityEngineering

One morning, I got a Slack ping that read: “JPMC's dev instance just crashed again. That’s the third time this week.”

Their environment was behind a custom Elastic Load Balancer with an autoscaling group. The problem? Sometimes the EC2 instance wouldn’t restart cleanly after a health check fail — and the app would just vanish.

Temporary Fix? Sure. But Make It Safe.

While engineering dug into logs, I configured a CloudWatch alarm to monitor EC2 health checks. If the instance failed, the alarm immediately triggered a Lambda function that rebooted it within seconds.

I also added:

Slack alerts via SNS for visibility
CloudWatch metrics to track restart frequency
A “backoff” timer so we didn’t enter an infinite reboot loop

“We only found out there was an issue because Apex informed us — it was already fixed by then.” — JPM Project Lead

Postmortem Follow-Up

The issue stemmed from a development patch pushed without full regression testing. A recursive SQL query caused the server to crash under load — and since the EC2 instance couldn’t recover automatically, it would’ve stayed down. But thanks to our automation, Lambda rebooted it within seconds.

SE Takeaways

Don’t wait for a root cause to act. SEs bridge the now and the later.
Buy time, build trust. Stability opens the door to deeper adoption.
Incident response is a team sport. But you can quarterback the play.

Fun fact: they kept using that Lambda as failsafe — even after the core issue was patched.