Previous incidents

December 2023
Dec 26, 2023
1 incident

Intermittent Failures due to networking

Degraded

Resolved Dec 26 at 06:58pm UTC

The error rate seen has subsided and models are seeing previous startup and runtime behavior. We are working with our providers mitigate impact of future incidents like this.

3 previous updates

Dec 22, 2023
2 incidents

Models not starting

Degraded

Resolved Dec 22 at 08:49pm UTC

The fix has been deployed and all model starts should be back to normal.

3 previous updates

Model Setup Failures

Degraded

Resolved Dec 22 at 02:15am UTC

All services are working as expected and all workarounds have been restored to normal behavior. Additionally we have made improvements to ensure we can more quickly respond by adding mitigations to any future incidents of this manner.

2 previous updates

Dec 21, 2023
1 incident

Model setup failing

Degraded

Resolved Dec 21 at 12:39am UTC

Code has been rolled back and models are no longer failing setup due to this issue.

1 previous update

Dec 19, 2023
1 incident

Models not booting

Degraded

Resolved Dec 19 at 01:46pm UTC

All queues have been processed and service should be back to normally. Sorry for the interruption folks.

3 previous updates

Dec 06, 2023
1 incident

Slow Model Startup

Degraded

Resolved Dec 06 at 10:14pm UTC

We have cleared up the backlog of models seeing a slow starts.

1 previous update

Dec 02, 2023
1 incident

NVIDIA Driver Issues

Resolved Dec 02 at 03:15pm UTC

We have identified a few nodes within one of our regions that exhibit issues with NVIDIA drivers not being installed. We have isolated these nodes from further workload scheduling (both inference and training) and will recycle the problematic nodes.

Dec 01, 2023
1 incident

Container Images pull delays

Degraded

Resolved Dec 01 at 10:34pm UTC

Thank you for your patience. We have cleared up the remaining backlog of pending workloads. Inference and Trainings are now running as expected for all hardware types.

2 previous updates

November 2023
Nov 14, 2023
1 incident

A100 GPU maintenance

Maintenance

Resolved Nov 14 at 06:59pm UTC

The maintenance event has passed. We believe impact to Replicate customers was minimal.

1 previous update

Nov 10, 2023
1 incident

Problems running some A40 models

Degraded

Resolved Nov 10 at 10:09pm UTC

We have confirmed and corrected any model versions erroneously disabled during this issue.

Use of A40s for predictions and trainings is now working as expected.

4 previous updates

Nov 08, 2023
1 incident

Replicate website unavailable

Downtime

Resolved Nov 08 at 03:53pm UTC

It looks to us like one of our providers had a brief outage and things are now coming back. We're continuing to monitor the situation.

(Technical details: it looks like an upstream provider had a brief DNSSEC zone signing outage.)

1 previous update

Nov 06, 2023
1 incident

Slow model startup in some cases

Degraded

Resolved Nov 06 at 12:53am UTC

The slow model startup has resolved. We will continue to work internally and with our provider to remediate the root cause.

1 previous update

Nov 05, 2023
1 incident

Slower predictions and webhook delivery

Degraded

Resolved Nov 05 at 08:03am UTC

The prediction and webhook delivery issues are resolved now. There might be still a delay in webhook delivery of older predictions.

2 previous updates

Nov 02, 2023
1 incident

Investigating predictions creation issues

Degraded

Resolved Nov 02 at 06:53pm UTC

The issue has been resolved and predictions are now functioning normally.

2 previous updates

Nov 01, 2023
2 incidents

Replicate Web Internal Service Error

Downtime

Resolved Nov 01 at 09:43pm UTC

Rollback of the problematic change has completed and Replicate website is now functioning normally again.

2 previous updates

SDXL Finetune errors

Degraded

Resolved Nov 01 at 06:37pm UTC

We have rolled out a fix and confirmed finetunes are working as expected.

2 previous updates

October 2023
Oct 19, 2023
1 incident

Predictions and trainings degraded

Degraded

Resolved Oct 19 at 02:30pm UTC

Predictions and trainings are back to normal.

1 previous update

Oct 10, 2023
1 incident

replicate.com database maintenance

Maintenance

Resolved Oct 10 at 12:24pm UTC

All done! Thanks for your patience.

2 previous updates

Oct 08, 2023
1 incident

Webhook delivery interrupted

Degraded

Resolved Oct 08 at 08:13pm UTC

We identified a problem affecting a small portion of customers -- slow responses to webhooks caused a backlog in processing outbound webhooks -- and have deployed a change to increase available webhook processing capacity. Webhook delivery is back to normal as of a few minutes ago.

1 previous update

Oct 06, 2023
1 incident

Pushing of new versions is broken

Degraded

Resolved Oct 06 at 11:20am UTC

We've fixed the issue and you should be able to push new versions again.

1 previous update

Oct 05, 2023
1 incident

Slow responses from replicate.com

Resolved Oct 05 at 07:29am UTC

Database load is back to normal, performance should be back to usual levels.