Predictions degraded for L40S, H100, and CPU hardware types
Resolved
Feb 07 at 06:55pm UTC
We are now caught up and running below capacity. Thanks for your patience!
Affected services
Prediction serving
Updated
Feb 07 at 06:30pm UTC
We are currently running at capacity. Most queues have caught up, but the possibility of delays still exists, so we will keep this incident open in a "degraded" state.
Affected services
Prediction serving
Updated
Feb 07 at 06:07pm UTC
We have cleaned up all of the models that were crashing or locked up, and we are now scaled out to max capacity while working through queue backlogs.
Affected services
Prediction serving
Created
Feb 07 at 05:40pm UTC
The majority of the delays we are seeing right now are due to models not setting up, which is likely due to a combination of configuration changes that clearly are not working as intended. We reverted the configuration changes and now we are in the process of cleaning up models that are crash looping or locked up, and starting to see capacity recover.
Affected services
Prediction serving