Previous incidents

November 2023
Nov 10, 2023
1 incident

Problems running some A40 models

Degraded

Resolved Nov 10 at 10:09pm UTC

We have confirmed and corrected any model versions erroneously disabled during this issue.

Use of A40s for predictions and trainings is now working as expected.

4 previous updates

Nov 08, 2023
1 incident

Replicate website unavailable

Downtime

Resolved Nov 08 at 03:53pm UTC

It looks to us like one of our providers had a brief outage and things are now coming back. We're continuing to monitor the situation.

(Technical details: it looks like an upstream provider had a brief DNSSEC zone signing outage.)

1 previous update

Nov 06, 2023
1 incident

Slow model startup in some cases

Degraded

Resolved Nov 06 at 12:53am UTC

The slow model startup has resolved. We will continue to work internally and with our provider to remediate the root cause.

1 previous update

Nov 05, 2023
1 incident

Slower predictions and webhook delivery

Degraded

Resolved Nov 05 at 08:03am UTC

The prediction and webhook delivery issues are resolved now. There might be still a delay in webhook delivery of older predictions.

2 previous updates

Nov 02, 2023
1 incident

Investigating predictions creation issues

Degraded

Resolved Nov 02 at 06:53pm UTC

The issue has been resolved and predictions are now functioning normally.

2 previous updates

Nov 01, 2023
2 incidents

Replicate Web Internal Service Error

Downtime

Resolved Nov 01 at 09:43pm UTC

Rollback of the problematic change has completed and Replicate website is now functioning normally again.

2 previous updates

SDXL Finetune errors

Degraded

Resolved Nov 01 at 06:37pm UTC

We have rolled out a fix and confirmed finetunes are working as expected.

2 previous updates

October 2023
Oct 19, 2023
1 incident

Predictions and trainings degraded

Degraded

Resolved Oct 19 at 02:30pm UTC

Predictions and trainings are back to normal.

1 previous update

Oct 08, 2023
1 incident

Webhook delivery interrupted

Degraded

Resolved Oct 08 at 08:13pm UTC

We identified a problem affecting a small portion of customers -- slow responses to webhooks caused a backlog in processing outbound webhooks -- and have deployed a change to increase available webhook processing capacity. Webhook delivery is back to normal as of a few minutes ago.

1 previous update

Oct 06, 2023
1 incident

Pushing of new versions is broken

Degraded

Resolved Oct 06 at 11:20am UTC

We've fixed the issue and you should be able to push new versions again.

1 previous update

Oct 05, 2023
1 incident

Slow responses from replicate.com

Resolved Oct 05 at 07:29am UTC

Database load is back to normal, performance should be back to usual levels.

September 2023
Sep 29, 2023
1 incident

Web and predictions degraded

Downtime

Resolved Sep 29 at 04:12pm UTC

We have now fully resolved the issues and API and replicate.com website are fully operational.

3 previous updates

Sep 28, 2023
1 incident

Predictions and training degraded for one cloud provider

Degraded

Resolved Sep 28 at 02:52pm UTC

The API and website are now working as expected for predictions and trainings.

1 previous update

Sep 27, 2023
2 incidents

Degraded API / API Errors

Degraded

Resolved Sep 27 at 07:31pm UTC

We have identified a problematic ingress pod and have caused it to reschedule. The API and website are now working as expected for predictions and trainings.

1 previous update

Website downtime / API degraded

Downtime

Resolved Sep 27 at 05:56pm UTC

We have rolled back the problematic change. Website functionality has been restored and API error rate has returned to normal.

1 previous update

Sep 21, 2023
2 incidents

Slow start on some predictions and trainings (A40 and some A100)

Degraded

Resolved Sep 21 at 09:06pm UTC

We have worked through the pending predictions and trainings and now see normal start times.

4 previous updates

System unavailable

Downtime

Resolved Sep 21 at 08:37pm UTC

We have recovered our caching service and see predictions and training succeeding.

1 previous update

Sep 20, 2023
1 incident

Temporary capacity issues with 8xA40 hardware type

Degraded

Resolved Sep 21 at 12:39am UTC

We resolved the capacity issues.

2 previous updates

Sep 19, 2023
1 incident

Primary database outage

Downtime

Resolved Sep 19 at 09:17pm UTC

Both the API and web are now back to normal. Predictions, trainings are functioning as expected.

We are continuing to monitor things.

2 previous updates

Sep 18, 2023
1 incident

API degraded

Degraded

Resolved Sep 18 at 04:50pm UTC

API is now behaving normally.

1 previous update

Sep 08, 2023
1 incident

Web and predictions degraded

Downtime

Resolved Sep 08 at 12:37pm UTC

Everything is resolved and back to normal.

During the downtime predictions were completing normally in API, but are not persisted.

1 previous update

Sep 07, 2023
1 incident

Degraded Prediction and Training Start Times

Degraded

Resolved Sep 07 at 08:06pm UTC

The issue with the upstream provider has been resolved. Predictions and Trainings are expected to be starting within normal timeframes.

1 previous update

Sep 06, 2023
1 incident

Degraded Prediction Handling

Degraded

Resolved Sep 06 at 04:15pm UTC

Prediction processing and prediction are working as expected now.

1 previous update

Sep 01, 2023
1 incident

Issues starting predictions

Degraded

Resolved Sep 01 at 08:45pm UTC

Everything should be working normally at this time.

4 previous updates