Post‑Mortem: Automations Retry Bug
Incident Window: Wednesday, June 11, 2025 – Friday, June 13, 2025
Services Affected: Automations impacting outbound calls
1 | Executive Summary
Between June 11th, 2025, and June 13th, 2025, the Automations service repeatedly retried calls that had not actually failed, causing spam call attempts and inconsistent customer workflows. Two patch releases resolved the issue fully. The immediate root cause was flawed retry logic introduced on June 11th that misclassified successful calls as failures.
2 | Detailed Timeline (EDT)
Refer to Incident History.
3 | Root Cause
- Logic flaw: Retry handler treated some call status responses as a failure, incrementing the retry count up to the limit. Due to the large daily call volume and servers being able to autoscale effectively, this did not trigger any automated alerting.
- Testing misses: This issue only happens with loop nodes. Product acceptance testing by product and customer success teams missed catching this edge case in the Staging environment.
4 | Contributing Factors
- Inadequate testing — Eng/CS/Product teams did not do enough testing to cover edge cases.
- No formal risk assessment — As a company, there was no risk assessment done for product features that impact calls during product planning.
- Missing alerting/monitoring — PagerDuty was not configured for anomalous retry spikes/edge cases.
5 | Preventive & Long‑Term Measures
- Identification of critical features: The product team will develop a list of critical features where failures could lead to broader unintended consequences, not just loss of functionality.
- Risk assessment: The product team will assess risk of each feature and plan alerting accordingly.
- Testing: The product team and customer success teams will do more comprehensive and holistic testing on changes to critical features.
- Monitoring: The engineering team will add real‑time retry alerts
- Kill-Switch functionality: Provide customers with the ability to immediately and completely disable all calls from their dashboard, providing control to customers in case of unintended calls