Errors within one region
Resolved
Mar 06 at 04:06am UTC
Workloads across all regions are now running normally. We apologise for the disruption, and will working to better improve our ability to shift load between providers in situations like this one.
Affected services
API
Prediction serving
Updated
Mar 06 at 03:05am UTC
Things remain in a degraded state but work is starting to flow again. We will continue monitoring and update when the service is fully recovered.
Affected services
API
Prediction serving
Updated
Mar 06 at 02:33am UTC
We're continuing to work with our provider, as one of our regions is currently unable to handle traffic. Workloads running on A40 and A100 (80GB) hardware are particularly affected.
Affected services
API
Prediction serving
Updated
Mar 06 at 01:58am UTC
The incident is involving network services within one of our providers. As the situation evolves we'll provide further updates.
We apologize for the inconvenience and thank you for your patience during this time.
Affected services
Prediction serving
Created
Mar 06 at 01:45am UTC
One of our regions is seeing elevated error rates for inference and training. We are working with out provider to determine root cause and remediate the issue.
This impacts A40, A100-80G, and a subset of A100-40G hardware types and can impact some language models (token based pricing).
Affected services
Prediction serving