Back to overview

Model startup Errors / Runtime Download Errors

Feb 16 at 05:48pm UTC
Affected services
Prediction serving

Feb 16 at 11:52pm UTC

Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery.

For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up for you next week. If you have immediate concerns please feel free to reach out to our customer team and we'll make it right.

Feb 16 at 09:29pm UTC

Demand and backlog remain high for GPUs in one of our regions. We have rebalanced traffic and working with our providers to further increase available GPUs to get the backlog worked through.

Feb 16 at 07:45pm UTC

We continue to see high demand and slow scheduling within one of our providers. Additionally we have drastically increased our GPU count to address the continued backlog.

Engineers are working to rebalance traffic between providers to accelerate recovery.

Feb 16 at 06:10pm UTC

Backlog of a100 models scheduling continues to be slow in one of our regions. We are working through the backlog of queue and scaling of replicas.

Engineers are closely monitoring and will provide an update once the backlog has been cleared.

Feb 16 at 05:59pm UTC

We are seeing recovery start and many models are starting. In addition we are looking into associated services (content-delivery-acceleration, etc) to ensure all services are returning to normal working order.

Most models are now fully started and the backlog is minimal.

An additional update will be provided as soon as all services have been verified.

Feb 16 at 05:48pm UTC

An problem has been identified an issue within one of our regions preventing model startups and downloads at runtime (inference).

We are working with our providers and within the region to correct the problem. Updates will be provided as they become available.

This impacts workloads on A40, some A100-40g hardware types, and A100-80g hardware types.