At Vectorflow, during our implementation for LendingTree, we ran into a classic reliability dilemma. Their badge system had just gone live and traffic surged way beyond expected load. Badge latency spiked, and our engineering team recommended pausing new feature rollouts until the issue stabilized.
But Sales had just promised a new "visitor pre-check" capability, and the client was expecting it that month.
The Fix: Data, Not Debates
I stepped in and introduced the concept of an error budget framework, not as a constraint, but as a shared language to make smarter decisions.
I worked across teams to define:
- 📋SLOs for badge latency and uptime by hospital wing
- 🔥Acceptable monthly burn rates tied to performance targets
- 📊A dashboard showing which clients were at risk of breaching those limits
"This gave us something we could all agree on, finally." — Director of Engineering
It Changed the Conversation
Instead of arguing opinions, we aligned around metrics. When LendingTree asked about new features, we explained the current burn rate and gave a revised timeline. Surprisingly, they appreciated the honesty and agreed to delay the rollout by a sprint.
Every team has tension. SEs do not eliminate it, we give it structure.
SE Takeaways
- 💼Reliability is not just an engineering concern. It is a business one too.
- 💪Error budgets give SEs leverage and credibility.
- 📊When in doubt, align with data.
More from the SE Blog