Previous incidents

September 2023
Sep 29, 2023
1 incident

Web and predictions degraded

Downtime

Resolved Sep 29 at 04:12pm UTC

We have now fully resolved the issues and API and replicate.com website are fully operational.

3 previous updates

Sep 28, 2023
1 incident

Predictions and training degraded for one cloud provider

Degraded

Resolved Sep 28 at 02:52pm UTC

The API and website are now working as expected for predictions and trainings.

1 previous update

Sep 27, 2023
2 incidents

Degraded API / API Errors

Degraded

Resolved Sep 27 at 07:31pm UTC

We have identified a problematic ingress pod and have caused it to reschedule. The API and website are now working as expected for predictions and trainings.

1 previous update

Website downtime / API degraded

Downtime

Resolved Sep 27 at 05:56pm UTC

We have rolled back the problematic change. Website functionality has been restored and API error rate has returned to normal.

1 previous update

Sep 21, 2023
2 incidents

Slow start on some predictions and trainings (A40 and some A100)

Degraded

Resolved Sep 21 at 09:06pm UTC

We have worked through the pending predictions and trainings and now see normal start times.

4 previous updates

System unavailable

Downtime

Resolved Sep 21 at 08:37pm UTC

We have recovered our caching service and see predictions and training succeeding.

1 previous update

Sep 20, 2023
1 incident

Temporary capacity issues with 8xA40 hardware type

Degraded

Resolved Sep 21 at 12:39am UTC

We resolved the capacity issues.

2 previous updates

Sep 19, 2023
1 incident

Primary database outage

Downtime

Resolved Sep 19 at 09:17pm UTC

Both the API and web are now back to normal. Predictions, trainings are functioning as expected.

We are continuing to monitor things.

2 previous updates

Sep 18, 2023
1 incident

API degraded

Degraded

Resolved Sep 18 at 04:50pm UTC

API is now behaving normally.

1 previous update

Sep 08, 2023
1 incident

Web and predictions degraded

Downtime

Resolved Sep 08 at 12:37pm UTC

Everything is resolved and back to normal.

During the downtime predictions were completing normally in API, but are not persisted.

1 previous update

Sep 07, 2023
1 incident

Degraded Prediction and Training Start Times

Degraded

Resolved Sep 07 at 08:06pm UTC

The issue with the upstream provider has been resolved. Predictions and Trainings are expected to be starting within normal timeframes.

1 previous update

Sep 06, 2023
1 incident

Degraded Prediction Handling

Degraded

Resolved Sep 06 at 04:15pm UTC

Prediction processing and prediction are working as expected now.

1 previous update

Sep 01, 2023
1 incident

Issues starting predictions

Degraded

Resolved Sep 01 at 08:45pm UTC

Everything should be working normally at this time.

4 previous updates

August 2023
Aug 19, 2023
1 incident

Issues scheduling to certain hardware

Degraded

Resolved Aug 19 at 11:41pm UTC

Thank you for your patience. At this time all hung workloads targeted for the T4 hardware should no longer be stuck in starting phase.

2 previous updates

Aug 18, 2023
1 incident

Replicate Web Down

Downtime

Resolved Aug 18 at 03:22pm UTC

Engineers have rolled back a change to the website and at this time the website should now be responding as expected.

4 previous updates

Aug 15, 2023
1 incident

Webside and API Outage

Downtime

Resolved Aug 15 at 07:28pm UTC

Reverting the identified change and purging known bad cache values has resolved the error rate within the API service. API and Web should be responding as expected at this time.

2 previous updates

Aug 11, 2023
1 incident

Delays starting some models

Degraded

Resolved Aug 11 at 03:13pm UTC

We believe that as of a few minutes ago the last customer impact from this issue has been resolved and all queues have cleared. To help you correlate this incident with any issues you may have seen: as far as we can tell the earliest customer impact from this incident started at about 11:00 UTC today.

3 previous updates

Aug 09, 2023
1 incident

Models not booting

Degraded

Resolved Aug 09 at 10:14pm UTC

The fix has been rolled out for all models.

2 previous updates

Aug 03, 2023
1 incident

520 error responses from API

Degraded

Resolved Aug 03 at 01:48pm UTC

We've identified the source of the errors -- a global load balancing service appears to have been misbehaving -- and made changes to how we serve api.replicate.com to mitigate the problem. As of a few minutes ago, we are no longer serving 520 error responses to customers.

1 previous update

Aug 02, 2023
1 incident

Replicate website unavailable

Downtime

Resolved Aug 02 at 04:48pm UTC

We're back! We pushed a bad change and have rolled it back. Sorry for the inconvenience.

1 previous update

July 2023
Jul 31, 2023
1 incident

Prediction requests failing

Downtime

Resolved Jul 31 at 12:15pm UTC

All prediction requests are now responding normally. We're still investigating the underlying cause.

1 previous update

Jul 28, 2023
2 incidents

API errors/timeouts

Degraded

Resolved Jul 28 at 06:10pm UTC

The API is fully recovered. Unfortunately we are still at least partially in the dark about what triggered these problems. We're continuing to investigate.

2 previous updates

API errors/timeouts

Degraded

Resolved Jul 28 at 05:32am UTC

Services have recovered. We'll be following up with our provider to understand how the scope of the planned maintenance expanded to affect customer workloads.

4 previous updates

Jul 26, 2023
1 incident

API errors

Degraded

Resolved Jul 26 at 10:51am UTC

We've identified a service that was starved of compute resources and addressed that problem. Service has been restored.

1 previous update

Jul 19, 2023
1 incident

Prediction creation errors

Degraded

Resolved Jul 19 at 06:35pm UTC

We've restored service to the queueing system and predictions are flowing again.

2 previous updates