Dashboard API Outage – calls are operational

Incident Report for Thoughtly

Postmortem

We experienced a brief service disruption that affected our platform and caused the dashboard to not load properly. The root cause stemmed from unusually high load on our cache instance, and a bug with invoking the cache instance.

Incident Summary

Surge in Volume
A surge in scheduled automations significantly increased the load on our Redis cache instance. This led to rapid Redis key creation and high memory/CPU utilization.
Redis Overload and API Back Pressure
Once Redis reached capacity, back pressure built up on the API layer. This caused API servers to degrade and unable to serve the dashboard for a brief time.
Cache logic bug
Our infrastructure is properly set up to handle surges, however, there was a new bug with how we create cache keys that caused this issue to be more severe than expected.

Corrective Actions Summary

Detection: Automated alerts notified our engineering team of elevated error rates and increased latency in the API.
Investigation: Initial checks pointed to Redis as a bottleneck. Deeper analysis revealed excessive key creation from the bug.
Mitigation: We immediately increased resources for the Redis instance (scaling CPU and memory) to alleviate pressure.
Resolution:
- Fixed bug to prevent unnecessary key creation.
- Validated changes in a staging environment and monitored production post-deployment.
- Added additional monitors to watch our cache layer more closely.

Posted Mar 28, 2025 - 17:37 EDT

Resolved

API Outage

Posted Mar 26, 2025 - 15:07 EDT

Update

We are continuing to investigate this issue.

Posted Mar 26, 2025 - 15:04 EDT

Update

API Load Balancer Health Check Degradation

Posted Mar 26, 2025 - 15:02 EDT

Investigating

We are currently investigating elevated queue times in automation runs causing a delay between launching and running the first step.

Posted Mar 26, 2025 - 14:58 EDT

This incident affected: Platform (API (US East), API (US West), API (Europe), API (Asia Pacific)).