Setup failures on L40S and H100 hardware
Resolved
Feb 06 at 10:45am UTC
This incident is now resolved.
Affected services
Prediction serving
Updated
Feb 06 at 10:27am UTC
Most systems are now operating normally again. We are continuing to monitor the situation.
Affected services
Prediction serving
Updated
Feb 06 at 10:06am UTC
As some of you may have noticed, things got worse before they got better. When the upstream storage provider restored service, models pending setup resulted in a large bandwidth surge. We're currently managing the effects of that surge, which has affected the speed of predictions and prediction webhook delivery.
Affected services
Prediction serving
Updated
Feb 06 at 09:09am UTC
We've identified the underlying problem -- a storage outage at an upstream provider -- and are investigating paths to mitigate the impact of the upstream outage.
Affected services
Prediction serving
Created
Feb 06 at 08:58am UTC
We're investigating an issue that's preventing some models running on L40S hardware from successfully completing setup. We'll update when we have more information.
Affected services
Prediction serving