Gladia - Notice history

All systems operational

API - Operational

100% - uptime
Nov 2024 · 99.99%Dec · 100.0%Jan 2025 · 100.0%
Nov 2024
Dec 2024
Jan 2025

Pre-Recorded v2 - Operational

100% - uptime
Nov 2024 · 98.39%Dec · 100.0%Jan 2025 · 99.91%
Nov 2024
Dec 2024
Jan 2025

Real-Time v1 - Operational

100% - uptime
Nov 2024 · 99.60%Dec · 100.0%Jan 2025 · 100.0%
Nov 2024
Dec 2024
Jan 2025

Real-Time v2 - Operational

99% - uptime
Nov 2024 · 99.34%Dec · 100.0%Jan 2025 · 98.79%
Nov 2024
Dec 2024
Jan 2025

App - Operational

100% - uptime
Nov 2024 · 100.0%Dec · 100.0%Jan 2025 · 100.0%
Nov 2024
Dec 2024
Jan 2025

Notice history

Jan 2025

API is unreachable via Recall.ai
  • Resolved
    Resolved

    Starting at around 15:30 UTC+1, all requests coming from Recall.ai began failing. The issue was traced to a specific edge case involving file uploads where a filename is not provided by the client, requiring our system to generate one. While most HTTP clients and browsers include a filename by default—making the issue rare—Recall.ai’s integration did not, triggering the problem.

    This behavior was caused by a recent change deployed at 15:00 UTC+1 to support JSON-formatted logs. As part of that change, the request ID, which is used to generate a filename in the absence of one, was no longer accessible. This resulted in 400 Bad Request errors for affected uploads.

    The issue was identified at 15:30 UTC+1. A fix was reviewed by 16:30 UTC+1 and deployed by 17:00 UTC+1, resolving the problem.

    We are updating our test coverage to include this specific edge case to prevent similar incidents in the future.

  • Monitoring
    Monitoring

    A fix has been deployed. Upload are working again. We are monitoring the situation.

  • Investigating
    Investigating

    Gladia API is currently unavailable via Recall.ai. We reached out to their team to find a solution.

Real Time V2 outage
  • Update
    Update

    Earlier today at around 02:41 UTC+1, services hosted on one of our cluster, including RTv2 and a portion of batch processing, went offline. This was due to an issue with controllers being incorrectly sized to handle the additional load generated by the addition of new nodes and pods to the cluster. The resulting overload made the controllers, and subsequently the entire cluster, unreachable.

    An issue with our alerting service, Instatus, prevented the on-call team from being notified.

    Investigations started at 08:00 UTC+1. A workaround was identified at 09:32 UTC+1, but it required additional time to stabilize. By 10:00 UTC+1, controllers and workers were operating stably and by 11:00 UTC+1, RTv2 services were restored.

    The root cause was traced to a human error during cluster resizing. The addition of new nodes and pods led to a spike in resource demand, which overwhelmed the controllers. Attempts to scale down nodes or increase CPU and RAM on the affected machines were unsuccessful. Stability was achieved after restarting all controllers one by one and reducing their load then recovering some GPU nodes.

    We will continue adding resources in the coming days to restore full functionality. Additionally, we have escalated the issue to Instatus, and they are actively working on a fix. We will implement additional layers of alerting for greater reliability.

  • Resolved
    Resolved

    This incident has been resolved.

  • Monitoring
    Monitoring
    We implemented a fix and are currently monitoring the result.
  • Identified
    Identified

    Real-time STT API is currently not available. The issue has been identified and we are working on restoring the services.

Dec 2024

No notices reported this month

Nov 2024

No notices reported this month

Nov 2024 to Jan 2025

Next