Previous incidents

May 2024
May 30, 2024
1 incident

Problems booting models in one region

Degraded

Resolved May 30 at 07:03pm UTC

All outstanding issues have been resolved. Model boots and setups should be functioning normally again.

2 previous updates

May 29, 2024
1 incident

Workloads in one of our clusters are backed up

Downtime

Resolved May 29 at 04:06pm UTC

All queues have been dealt with, and predictions and trainings are running smoothly once again.

3 previous updates

May 22, 2024
1 incident

Degraded autoscaling performance

Degraded

Resolved May 22 at 02:05pm UTC

Backlogs have been cleared and all models are now running smoothly.

3 previous updates

May 17, 2024
1 incident

5XX and slow responses

Degraded

Resolved May 17 at 12:09am UTC

The source of the problem appears to have been that our API was unable to connect to one of its underlying data stores, most likely due to a networking interruption. This has recovered as of 00:02 UTC and traffic is being served normally once again. We will continue to monitor.

1 previous update

May 09, 2024
1 incident

Webhooks not sending for Dreambooth trainings

Degraded

Resolved May 09 at 07:11pm UTC

Webhooks for Dreambooth trainings are working again.

1 previous update

April 2024
Apr 12, 2024
1 incident

Degraded Service

Degraded

Resolved Apr 12 at 05:20am UTC

At this time service has been restored. All inference (prediction serving) and model instance starts have returned to normal.

2 previous updates

Apr 11, 2024
1 incident

Degraded service

Degraded

Resolved Apr 11 at 02:05pm UTC

Our systems indicate that the problem has been resolved. We will continue monitoring the situation.

2 previous updates

March 2024
Mar 14, 2024
1 incident

A40 models scaling slowly

Degraded

Resolved Mar 14 at 09:19am UTC

All but a very small slice of our A40 hardware is back online, and Replicate workloads are processing normally. We again thank you for your patience.

5 previous updates

Mar 06, 2024
1 incident

Errors within one region

Downtime

Resolved Mar 06 at 04:06am UTC

Workloads across all regions are now running normally. We apologise for the disruption, and will working to better improve our ability to shift load between providers in situations like this one.

4 previous updates

Mar 05, 2024
1 incident

API, Inference, Web Page

Downtime

Resolved Mar 05 at 06:54am UTC

We identified excessive load on our database. Shortly after the root cause was isolated, our engineering team disabled the problematic code paths.

Degradation of database responsiveness resulted in the general service outage beginning at approximately 6:23UTC.

At this time API, Inference, and Web are fully functional. Any predictions that resulted in errors or were marked as failed during this window can be safely re-run.

1 previous update