Real Time V2 outage

Update

January 14, 2025 at 3:57 PM

Update

January 14, 2025 at 3:57 PM

Earlier today at around 02:41 UTC+1, services hosted on one of our cluster, including RTv2 and a portion of batch processing, went offline. This was due to an issue with controllers being incorrectly sized to handle the additional load generated by the addition of new nodes and pods to the cluster. The resulting overload made the controllers, and subsequently the entire cluster, unreachable.

An issue with our alerting service, Instatus, prevented the on-call team from being notified.

Investigations started at 08:00 UTC+1. A workaround was identified at 09:32 UTC+1, but it required additional time to stabilize. By 10:00 UTC+1, controllers and workers were operating stably and by 11:00 UTC+1, RTv2 services were restored.

The root cause was traced to a human error during cluster resizing. The addition of new nodes and pods led to a spike in resource demand, which overwhelmed the controllers. Attempts to scale down nodes or increase CPU and RAM on the affected machines were unsuccessful. Stability was achieved after restarting all controllers one by one and reducing their load then recovering some GPU nodes.

We will continue adding resources in the coming days to restore full functionality. Additionally, we have escalated the issue to Instatus, and they are actively working on a fix. We will implement additional layers of alerting for greater reliability.

Resolved

January 14, 2025 at 9:51 AM

Resolved

January 14, 2025 at 9:51 AM

This incident has been resolved.

Monitoring

January 14, 2025 at 8:32 AM

Monitoring

January 14, 2025 at 8:32 AM

We implemented a fix and are currently monitoring the result.

Identified

January 14, 2025 at 1:41 AM

Identified

January 14, 2025 at 1:41 AM

Real-time STT API is currently not available. The issue has been identified and we are working on restoring the services.

Gladia - Real Time V2 outage – Incident details

All systems operational