Automations Loop Node Retry Bug [Resolved]

Incident Report for Thoughtly

Postmortem

Post‑Mortem: Automations Retry Bug

Incident Window: Wednesday, June 11, 2025 – Friday, June 13, 2025

Services Affected: Automations impacting outbound calls

1 | Executive Summary

Between June 11th, 2025, and June 13th,  2025, the Automations service repeatedly retried calls that had not actually failed, causing spam call attempts and inconsistent customer workflows. Two patch releases resolved the issue fully. The immediate root cause was flawed retry logic introduced on June 11th that misclassified successful calls as failures.

2 | Detailed Timeline (EDT)

Refer to Incident History.

3 | Root Cause

  • Logic flaw: Retry handler treated some call status responses as a failure, incrementing the retry count up to the limit. Due to the large daily call volume and servers being able to autoscale effectively, this did not trigger any automated alerting.
  • Testing misses: This issue only happens with loop nodes. Product acceptance testing by product and customer success teams missed catching this edge case in the Staging environment.

4 | Contributing Factors

  1. Inadequate testing  —  Eng/CS/Product teams did not do enough testing to cover edge cases.
  2. No formal risk assessment — As a company, there was no risk assessment done for product features that impact calls during product planning.
  3. Missing alerting/monitoring — PagerDuty was not configured for anomalous retry spikes/edge cases.

5 | Preventive & Long‑Term Measures

  • Identification of critical features: The product team will develop a list of critical features where failures could lead to broader unintended consequences, not just loss of functionality.
  • Risk assessment: The product team will assess risk of each feature and plan alerting accordingly.
  • Testing: The product team and customer success teams will do more comprehensive and holistic testing on changes to critical features.
  • Monitoring: The engineering team will add real‑time retry alerts
  • Kill-Switch functionality: Provide customers with the ability to immediately and completely disable all calls from their dashboard, providing control to customers in case of unintended calls
Posted Jun 17, 2025 - 14:31 EDT

Resolved

Note: This issue was not surfaced on the status page in real time; we’re publishing a retroactive summary for transparency.

Post‑Mortem: Automations Loop Node Retry Bug

Incident Window: Wednesday, June 11, 2025 – Friday, June 13, 2025 
Services Affected: Automations impacting outbound calls

________________________________________
1. Executive Summary

Between June 11th, 2025, and June 13th,  2025, the Automations service repeatedly retried calls that had not actually failed, causing spam call attempts and inconsistent customer workflows. Two patch releases resolved the issue fully. The immediate root cause was flawed retry logic introduced on June 11th that misclassified successful calls as failures.

________________________________________
2. Detailed Timeline (EDT)

Date: Wed 11 June

Time: 14:43
Event: The customer success team reported a single customer account has an issue with retries. It was marked a P2 priority issue, local to the customer account. It was reported as an issue with max attempt field value causing retries. This was an incomplete assessment as it was later found that a group of accounts was affected more broadly.

Time: 20:08
Event: The engineering team found a solution to stop unexpected retries when max attempt is set.

Date: Thu 12 Jun

Time: 09:39
Fix was deployed after customer success and product teams signed off on testing in the Staging environment.
Event: Fix was deployed after customer success and product teams signed off on testing in the Staging environment.

Time: 15:37
Event: The customer success team reported a new issue with retry when the max attempt field is not set. At this time the incident is still believed to be isolated.

Time: 21:03
Event: The CS team received a report from another account and discovered the issue is not isolated. The affected accounts were immediately disabled while the team worked on a solution. This escalated the incident to a P1 issue

Date: Fri 13 Jun

Time: 10:45
Event: The engineering team found a solution to remove retries

Time: 11:45
Event: The solution was tested in staging and got sign off from the customer success and product teams.

Time: 12:05
Event: The fix was deployed to production.

Time: 12:17
Event: Fix verified in production by customer success and product teams.

Time: 13:00
Event: Incident closed, post‑mortem scheduled.

________________________________________
3. Root Cause

Logic flaw: Retry handler treated some call status responses as a failure, incrementing the retry count up to the limit. Due to the large daily call volume and servers being able to autoscale effectively, this did not trigger any automated alerting.

Testing misses: This issue only happens with loop nodes. Product acceptance testing by product and customer success teams missed catching this edge case in the Staging environment.

________________________________________
4. Contributing Factors

Inadequate testing  —  Eng/CS/Product teams did not do enough testing to cover edge cases.

No formal risk assessment — As a company, there was no risk assessment done for product features that impact calls during product planning.

Missing alerting/monitoring — PagerDuty was not configured for anomalous retry spikes/edge cases.

________________________________________
5. Preventive & Long‑Term Measures

Identification of critical features: The product team will develop a list of critical features where failures could lead to broader unintended consequences, not just loss of functionality.

Risk assessment: The product team will assess risk of each feature and plan alerting accordingly.

Testing: The product team and customer success teams will do more comprehensive and holistic testing on changes to critical features.
Monitoring: The engineering team will add real‑time retry alerts

Kill-Switch functionality: Provide customers with the ability to immediately and completely disable all calls from their dashboard, providing control to customers in case of unintended calls
Posted Jun 16, 2025 - 18:28 EDT
This incident affected: Platform (Automations).