Previous incidents

February 2025
Feb 07, 2025
1 incident

Predictions degraded for L40S, H100, and CPU hardware types

Degraded

Resolved Feb 07 at 06:55pm UTC

We are now caught up and running below capacity. Thanks for your patience!

3 previous updates

Feb 06, 2025
2 incidents

API instability for L40S, H100 and CPU workloads.

Degraded

Resolved Feb 06 at 03:20pm UTC

This issue appears to have been a result of another bandwidth spike partly as a result of our incident earlier today. The issue has now been resolved. We are going to be working to prevent incidents of this kind from recurring.

1 previous update

Setup failures on L40S and H100 hardware

Degraded

Resolved Feb 06 at 10:45am UTC

This incident is now resolved.

4 previous updates

Feb 03, 2025
1 incident

Prediction creation unavailable for L40S and H100 hardware

Resolved Feb 03 at 06:30am UTC

The cache used by the API for predictions was misconfigured for a period of ~20 minutes beginning at 20:34 UTC until a rollback completed at 20:56 UTC. Models using the L40S and H100 hardware types were affected. During the period of misconfiguration, prediction creation was severely limited, resulting in many API responses with status 503.

January 2025
Jan 30, 2025
1 incident

Instability and delays for H100 and L40S

Degraded

Resolved Jan 30 at 09:56am UTC

The networking issue with our provider was resolved at 0940 UTC, and all requests have been running normally since then.

2 previous updates

Jan 15, 2025
1 incident

Billing and metric delays

Degraded

Resolved Jan 15 at 04:29pm UTC

The background jobs are running again, and we've caught up to present as of about 1609 UTC (20 minutes ago).

1 previous update

Jan 09, 2025
1 incident

Dashboard inaccessible due to redirect

Degraded

Resolved Jan 09 at 04:37pm UTC

The redirect has been reverted and the dashboard should be accessible again.

1 previous update

December 2024
Dec 14, 2024
1 incident

L40s temporary stock out

Resolved Dec 14 at 12:41am UTC

At 22.15 UTC Jan 13 an issue forced us to shift some GPU workloads, which caused stock outs leading to increased wait times to spin up new model instances using L40s.

The work has completed and GPUs are now available as normal as of 00.15 UTC Jan 15th.

Dec 12, 2024
1 incident

Data deletion delayed

Degraded

Resolved Dec 14 at 01:28pm UTC

We've caught up with prediction deletion, and our system is once again deleting predictions on time.

1 previous update

Dec 11, 2024
1 incident

T4 predictions unavailable

Resolved Dec 11 at 08:27pm UTC

T4 predictions were unavailable approximately between the hours of 1800 and 2027 UTC. We found an issue with the nvidia driver installation on our T4 hardware targets.

This only affected predictions running against the T4 hardware.

We have deployed a fix and are backfilling the outstanding predictions.