Model startup Errors / Runtime Download Errors
Resolved
Feb 16 at 11:52pm UTC
Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery.
For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up for you next week. If you have immediate concerns please feel free to reach out to our customer team and we'll make it right.
Affected services
API
Prediction serving
Updated
Feb 16 at 09:29pm UTC
Demand and backlog remain high for GPUs in one of our regions. We have rebalanced traffic and working with our providers to further increase available GPUs to get the backlog worked through.
Affected services
API
Prediction serving
Updated
Feb 16 at 07:45pm UTC
We continue to see high demand and slow scheduling within one of our providers. Additionally we have drastically increased our GPU count to address the continued backlog.
Engineers are working to rebalance traffic between providers to accelerate recovery.
Affected services
API
Prediction serving
Updated
Feb 16 at 06:10pm UTC
Backlog of a100 models scheduling continues to be slow in one of our regions. We are working through the backlog of queue and scaling of replicas.
Engineers are closely monitoring and will provide an update once the backlog has been cleared.
Affected services
API
Prediction serving
Updated
Feb 16 at 05:59pm UTC
We are seeing recovery start and many models are starting. In addition we are looking into associated services (content-delivery-acceleration, etc) to ensure all services are returning to normal working order.
Most models are now fully started and the backlog is minimal.
An additional update will be provided as soon as all services have been verified.
Affected services
API
Prediction serving
Created
Feb 16 at 05:48pm UTC
An problem has been identified an issue within one of our regions preventing model startups and downloads at runtime (inference).
We are working with our providers and within the region to correct the problem. Updates will be provided as they become available.
This impacts workloads on A40, some A100-40g hardware types, and A100-80g hardware types.
Affected services
API
Prediction serving