Previous incidents

March 2024
Mar 14, 2024
1 incident

A40 models scaling slowly

Degraded

Resolved Mar 14 at 09:19am UTC

All but a very small slice of our A40 hardware is back online, and Replicate workloads are processing normally. We again thank you for your patience.

5 previous updates

Mar 06, 2024
1 incident

Errors within one region

Downtime

Resolved Mar 06 at 04:06am UTC

Workloads across all regions are now running normally. We apologise for the disruption, and will working to better improve our ability to shift load between providers in situations like this one.

4 previous updates

Mar 05, 2024
1 incident

API, Inference, Web Page

Downtime

Resolved Mar 05 at 06:54am UTC

We identified excessive load on our database. Shortly after the root cause was isolated, our engineering team disabled the problematic code paths.

Degradation of database responsiveness resulted in the general service outage beginning at approximately 6:23UTC.

At this time API, Inference, and Web are fully functional. Any predictions that resulted in errors or were marked as failed during this window can be safely re-run.

1 previous update

February 2024
Feb 28, 2024
1 incident

Models Affected by Hugging Face Hub Outage

Degraded

Resolved Feb 29 at 01:57am UTC

We are seeing HF Hub return to full functionality. The backlog of models blocked on interacting (downloading or otherwise) with HuggingFace have recovered.

1 previous update

Feb 26, 2024
1 incident

Create prediction/training API unavailable

Resolved Feb 26 at 01:13pm UTC

From approximately 13:05 to 13:11 UTC, our prediction and training creation endpoints were unavailable. Existing predictions and trainings were unaffected, but no new predictions or trainings could be created.

The problem has since been resolved.

Feb 22, 2024
2 incidents

Dropped predictions

Degraded

Resolved Feb 22 at 05:46pm UTC

This issue has been resolved and predictions are now flowing again.

1 previous update

API errors

Degraded

Resolved Feb 22 at 09:53am UTC

The problems resolved automatically at 09:41 UTC. We are monitoring the situation.

1 previous update

Feb 19, 2024
1 incident

Models stuck booting

Degraded

Resolved Feb 19 at 04:27pm UTC

The models stuck in booting have been fixed, and inference and trainings are both running normally again.

2 previous updates

Feb 16, 2024
1 incident

Model startup Errors / Runtime Download Errors

Degraded

Resolved Feb 16 at 11:52pm UTC

Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery.

For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up f...

5 previous updates

Feb 10, 2024
1 incident

Errors downloading weights on model startup

Degraded

Resolved Feb 10 at 11:16pm UTC

We've not seen any failures after 22:50 UTC, so we're calling this incident resolved.

Our investigation revealed that internal DNS lookup failures put a storage cache subsystem into a broken state. Next week we'll be looking into how to make our systems more robust in situations like this one.

Thank you for your patience.

3 previous updates

Feb 08, 2024
1 incident

Errors downloading weights on model startup

Degraded

Resolved Feb 08 at 10:00pm UTC

Mitigations are in place and now models are again downloading weights as expected.

All model startups within the affected region have returned to normal.

1 previous update

Feb 06, 2024
1 incident

Trained versions failing setup

Degraded

Resolved Feb 06 at 06:30pm UTC

We've resolved this issue and have re-enabled all affected versions.

1 previous update

January 2024
Jan 29, 2024
1 incident

Delayed Model Start

Degraded

Resolved Jan 29 at 06:31pm UTC

The backlog of models has been cleared. Model start time is back to expected times.

3 previous updates

Jan 22, 2024
1 incident

Errors completing predictions

Downtime

Resolved Jan 22 at 07:34pm UTC

A fix has been rolled out to the majority of models and errors rate has returned to normal levels. We will continue to monitor to address any more occurrences of the errors.

Predictions affected by this incident (many on T4 gpus, CPU, and a subset of a100s) will appear to be stuck in the starting phase for an extended period of time. These predictions can safely be cancelled and reattempted.

3 previous updates

Jan 18, 2024
1 incident

Erroneous Alerts of Website Down

Resolved Jan 18 at 12:37am UTC

There have been a number of automated reports the Replicate website has gone down/returned to service. We are investigating the automated systems but do not see any current outages outside of the automated tooling.

Jan 09, 2024
1 incident

Model Start times

Degraded

Resolved Jan 09 at 05:23pm UTC

The incident is in process of clearing and model start times have returned to normal.

1 previous update

Jan 05, 2024
1 incident

Slow Model Startup Time

Degraded

Resolved Jan 05 at 07:13pm UTC

At this time all models in the backlog have finished startup. We will continue to monitor the situation closely.

1 previous update

Jan 02, 2024
1 incident

Boot time issues for Models

Resolved Jan 02 at 04:46pm UTC

We are aware of an event in one of our regions that resulted in extended boot times of many models. At this time the incident has resolved. We are actively researching the root cause and will work to build remediations to limit impact of future such events.