All Systems Operational

About This Site

Welcome to the Thoughtly status page. Here, you can find system metrics and real-time updates on the operational status of our platform, APIs, voice infrastructure, and key services.

Thoughtly's state-of-the-art AI Voice Agent infrastructure is designed for enterprise-grade reliability, ensuring seamless, real-time telephony conversations with minimal latency. Our globally distributed systems are built with redundancy, failover mechanisms, and stringent uptime commitments to support mission-critical business operations.

Platform Operational
90 days ago
99.97 % uptime
Today
API (US East) ? Operational
90 days ago
99.96 % uptime
Today
API (US West) ? Operational
90 days ago
99.96 % uptime
Today
API (Europe) ? Operational
90 days ago
99.96 % uptime
Today
API (Asia Pacific) ? Operational
90 days ago
99.96 % uptime
Today
Dashboard ? Operational
90 days ago
99.96 % uptime
Today
CDN ? Operational
90 days ago
100.0 % uptime
Today
Automations Operational
90 days ago
99.96 % uptime
Today
Voice Infrastructure Operational
90 days ago
99.99 % uptime
Today
Agent ? Operational
90 days ago
99.96 % uptime
Today
Carrier Gateway ? Operational
90 days ago
100.0 % uptime
Today
STT ? Operational
90 days ago
100.0 % uptime
Today
TTS ? Operational
90 days ago
100.0 % uptime
Today
LLM ? Operational
90 days ago
100.0 % uptime
Today
SIP ? Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Agent Latency ?
Fetching
Jun 30, 2025

No incidents reported today.

Jun 29, 2025

No incidents reported.

Jun 28, 2025

No incidents reported.

Jun 27, 2025

No incidents reported.

Jun 26, 2025

No incidents reported.

Jun 25, 2025

No incidents reported.

Jun 24, 2025

No incidents reported.

Jun 23, 2025

No incidents reported.

Jun 22, 2025

No incidents reported.

Jun 21, 2025

No incidents reported.

Jun 20, 2025

No incidents reported.

Jun 19, 2025

No incidents reported.

Jun 18, 2025

No incidents reported.

Jun 17, 2025

No incidents reported.

Jun 16, 2025
Postmortem - Read details
Jun 17, 14:31 EDT
Resolved - Note: This issue was not surfaced on the status page in real time; we’re publishing a retroactive summary for transparency.

Post‑Mortem: Automations Loop Node Retry Bug

Incident Window: Wednesday, June 11, 2025 – Friday, June 13, 2025 
Services Affected: Automations impacting outbound calls

________________________________________
1. Executive Summary

Between June 11th, 2025, and June 13th,  2025, the Automations service repeatedly retried calls that had not actually failed, causing spam call attempts and inconsistent customer workflows. Two patch releases resolved the issue fully. The immediate root cause was flawed retry logic introduced on June 11th that misclassified successful calls as failures.

________________________________________
2. Detailed Timeline (EDT)

Date: Wed 11 June

Time: 14:43
Event: The customer success team reported a single customer account has an issue with retries. It was marked a P2 priority issue, local to the customer account. It was reported as an issue with max attempt field value causing retries. This was an incomplete assessment as it was later found that a group of accounts was affected more broadly.

Time: 20:08
Event: The engineering team found a solution to stop unexpected retries when max attempt is set.

Date: Thu 12 Jun

Time: 09:39
Fix was deployed after customer success and product teams signed off on testing in the Staging environment.
Event: Fix was deployed after customer success and product teams signed off on testing in the Staging environment.

Time: 15:37
Event: The customer success team reported a new issue with retry when the max attempt field is not set. At this time the incident is still believed to be isolated.

Time: 21:03
Event: The CS team received a report from another account and discovered the issue is not isolated. The affected accounts were immediately disabled while the team worked on a solution. This escalated the incident to a P1 issue

Date: Fri 13 Jun

Time: 10:45
Event: The engineering team found a solution to remove retries

Time: 11:45
Event: The solution was tested in staging and got sign off from the customer success and product teams.

Time: 12:05
Event: The fix was deployed to production.

Time: 12:17
Event: Fix verified in production by customer success and product teams.

Time: 13:00
Event: Incident closed, post‑mortem scheduled.

________________________________________
3. Root Cause

Logic flaw: Retry handler treated some call status responses as a failure, incrementing the retry count up to the limit. Due to the large daily call volume and servers being able to autoscale effectively, this did not trigger any automated alerting.

Testing misses: This issue only happens with loop nodes. Product acceptance testing by product and customer success teams missed catching this edge case in the Staging environment.

________________________________________
4. Contributing Factors

Inadequate testing  —  Eng/CS/Product teams did not do enough testing to cover edge cases.

No formal risk assessment — As a company, there was no risk assessment done for product features that impact calls during product planning.

Missing alerting/monitoring — PagerDuty was not configured for anomalous retry spikes/edge cases.

________________________________________
5. Preventive & Long‑Term Measures

Identification of critical features: The product team will develop a list of critical features where failures could lead to broader unintended consequences, not just loss of functionality.

Risk assessment: The product team will assess risk of each feature and plan alerting accordingly.

Testing: The product team and customer success teams will do more comprehensive and holistic testing on changes to critical features.
Monitoring: The engineering team will add real‑time retry alerts

Kill-Switch functionality: Provide customers with the ability to immediately and completely disable all calls from their dashboard, providing control to customers in case of unintended calls

Jun 16, 18:28 EDT