Workloads in one of our clusters are backed up
Status Report Update State Resolved
May 29 at 04:06pm UTC
All queues have been dealt with, and predictions and trainings are running smoothly once again.
Affected services
Prediction serving
Status Report Update State Updated
May 29 at 03:25pm UTC
Predictions and trainings are running again. There are still some substantial queues, so it will take a while for the autoscaler to get everything processed.
We'll monitor it until it's fully recovered.
Affected services
Prediction serving
Status Report Update State Updated
May 29 at 03:22pm UTC
The majority of predictions and trainings are failing to start in one of our clusters. All A40 workloads and most A100 workloads are affected.
The upstream provider is investigating the issue.
Affected services
Prediction serving
Status Report Update State Created
May 29 at 03:17pm UTC
We're investigating an issue with predictions and trainings in one of our clusters, due to an incident with one of our providers. Workloads running on A40s and A100s are affected.
Affected services
Prediction serving