Back to overview
Degraded

Disruption of Prediction Serving

Mar 08 at 06:25am UTC
Affected services
API
Prediction serving

Resolved
Mar 08 at 08:26am UTC

All backlog is being worked through and at this point all services have been restored to full functionality.

Again thank you for your patience.

Updated
Mar 08 at 07:50am UTC

At this time most services and prediction service has returned to normal. We are seeing elevated and incorrect contention for the GPU types (L40S and H100) making scaling of new instances slower than expected.

Updated
Mar 08 at 07:21am UTC

L40S, H100, and CPU hardware types continue to see degraded prediction performance.

Subsequent failures are being seen as we work through our backlogs. We will provide updates as information becomes available.

Updated
Mar 08 at 06:54am UTC

As of this time, all services have been restored and the faulty hardware has been removed from production.

Thank you for your patience.

Updated
Mar 08 at 06:37am UTC

Critical Services have been successfully migrated. As our final services come back online we are monitoring and expecting continued degraded service.

Created
Mar 08 at 06:25am UTC

We are experiencing a disruption in prediction serving for the H200, L40S, and CPU hardware types due to a hardware failure.

We are in process of bringing critical services back online.