← Back to SE Blog

Post-Sales · Incident Response · Reliability

From Crashes to Confidence: How I Kept JPMorgan's Servers Online

By Chinmaya Chhatre · Solutions Engineer

One morning, I got a Slack ping that read: "JPMC's dev instance just crashed again. That is the third time this week."

Their environment was behind a custom Elastic Load Balancer with an autoscaling group. The problem? Sometimes the EC2 instance would not restart cleanly after a health check fail, and the app would just vanish.

Temporary Fix? Sure. But Make It Safe.

While engineering dug into logs, I configured a CloudWatch alarm to monitor EC2 health checks. If the instance failed, the alarm immediately triggered a Lambda function that rebooted it within seconds.

I also added:

"We only found out there was an issue because Apex informed us. It was already fixed by then." — JPM Project Lead

Postmortem Follow-Up

The issue stemmed from a development patch pushed without full regression testing. A recursive SQL query caused the server to crash under load. Since the EC2 instance could not recover automatically, it would have stayed down. But thanks to our automation, Lambda rebooted it within seconds.

Buy time, build trust. Stability opens the door to deeper adoption.

SE Takeaways

Fun fact: they kept using that Lambda as failsafe even after the core issue was patched.

More from the SE Blog

Rate my site!