We experienced a brief service disruption that affected our platform and caused the dashboard to not load properly. The root cause stemmed from unusually high load on our cache instance, and a bug with invoking the cache instance.
Incident Summary
Surge in Volume
A surge in scheduled automations significantly increased the load on our Redis cache instance. This led to rapid Redis key creation and high memory/CPU utilization.
Redis Overload and API Back Pressure
Once Redis reached capacity, back pressure built up on the API layer. This caused API servers to degrade and unable to serve the dashboard for a brief time.
Cache logic bug
Our infrastructure is properly set up to handle surges, however, there was a new bug with how we create cache keys that caused this issue to be more severe than expected.
Corrective Actions Summary
Detection: Automated alerts notified our engineering team of elevated error rates and increased latency in the API.
Investigation: Initial checks pointed to Redis as a bottleneck. Deeper analysis revealed excessive key creation from the bug.
Mitigation: We immediately increased resources for the Redis instance (scaling CPU and memory) to alleviate pressure.
Resolution:
Fixed bug to prevent unnecessary key creation.
Validated changes in a staging environment and monitored production post-deployment.
Added additional monitors to watch our cache layer more closely.
Posted Mar 28, 2025 - 17:37 EDT
Resolved
API Outage
Posted Mar 26, 2025 - 15:07 EDT
Update
We are continuing to investigate this issue.
Posted Mar 26, 2025 - 15:04 EDT
Update
API Load Balancer Health Check Degradation
Posted Mar 26, 2025 - 15:02 EDT
Investigating
We are currently investigating elevated queue times in automation runs causing a delay between launching and running the first step.
Posted Mar 26, 2025 - 14:58 EDT
This incident affected: Platform (API (US East), API (US West), API (Europe), API (Asia Pacific)).