H100 model serving down
Resolved
Oct 29 at 05:46pm UTC
We have moved traffic for the models impacted (H100 target hardware) back to the H100 class GPUs. Predictions and trainings targeting H100 class GPUs have returned to normal.
Affected services
API
Prediction serving
Updated
Oct 29 at 04:10pm UTC
Predictions on flux-dev are now also running in a different cluster.
Affected services
API
Prediction serving
Updated
Oct 29 at 03:45pm UTC
Predictions on flux-schell and flux fine tunes are successfully running in another cluster. Predictions on flux-dev are still not working.
Affected services
API
Prediction serving
Updated
Oct 29 at 03:37pm UTC
We're moved flux models and fine tunes to run in a different cluster until we can get this cluster back online.
Affected services
API
Prediction serving
Created
Oct 29 at 03:28pm UTC
One of our clusters is currently down. We know the immediate cause, and are working on fixing it. This is the cluster that runs our H100s, so all H100 models are currently down, including flux and flux fine tunes.
Affected services
API
Prediction serving