Affected
Major outage from 1:41 AM to 9:51 AM
- UpdateUpdate
Earlier today at around 02:41 UTC+1, services hosted on one of our cluster, including RTv2 and a portion of batch processing, went offline. This was due to an issue with controllers being incorrectly sized to handle the additional load generated by the addition of new nodes and pods to the cluster. The resulting overload made the controllers, and subsequently the entire cluster, unreachable.
An issue with our alerting service, Instatus, prevented the on-call team from being notified.
Investigations started at 08:00 UTC+1. A workaround was identified at 09:32 UTC+1, but it required additional time to stabilize. By 10:00 UTC+1, controllers and workers were operating stably and by 11:00 UTC+1, RTv2 services were restored.
The root cause was traced to a human error during cluster resizing. The addition of new nodes and pods led to a spike in resource demand, which overwhelmed the controllers. Attempts to scale down nodes or increase CPU and RAM on the affected machines were unsuccessful. Stability was achieved after restarting all controllers one by one and reducing their load then recovering some GPU nodes.
We will continue adding resources in the coming days to restore full functionality. Additionally, we have escalated the issue to Instatus, and they are actively working on a fix. We will implement additional layers of alerting for greater reliability.
- ResolvedResolved
This incident has been resolved.
- MonitoringMonitoringWe implemented a fix and are currently monitoring the result.
- IdentifiedIdentified
Real-time STT API is currently not available. The issue has been identified and we are working on restoring the services.