Previous incidents
Models Affected by Hugging Face Hub Outage
Resolved Feb 29 at 01:57am UTC
We are seeing HF Hub return to full functionality. The backlog of models blocked on interacting (downloading or otherwise) with HuggingFace have recovered.
1 previous update
Create prediction/training API unavailable
Resolved Feb 26 at 01:13pm UTC
From approximately 13:05 to 13:11 UTC, our prediction and training creation endpoints were unavailable. Existing predictions and trainings were unaffected, but no new predictions or trainings could be created.
The problem has since been resolved.
Dropped predictions
Resolved Feb 22 at 05:46pm UTC
This issue has been resolved and predictions are now flowing again.
1 previous update
API errors
Resolved Feb 22 at 09:53am UTC
The problems resolved automatically at 09:41 UTC. We are monitoring the situation.
1 previous update
Models stuck booting
Resolved Feb 19 at 04:27pm UTC
The models stuck in booting have been fixed, and inference and trainings are both running normally again.
2 previous updates
Model startup Errors / Runtime Download Errors
Resolved Feb 16 at 11:52pm UTC
Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery.
For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up f...
5 previous updates
Errors downloading weights on model startup
Resolved Feb 10 at 11:16pm UTC
We've not seen any failures after 22:50 UTC, so we're calling this incident resolved.
Our investigation revealed that internal DNS lookup failures put a storage cache subsystem into a broken state. Next week we'll be looking into how to make our systems more robust in situations like this one.
Thank you for your patience.
3 previous updates
Errors downloading weights on model startup
Resolved Feb 08 at 10:00pm UTC
Mitigations are in place and now models are again downloading weights as expected.
All model startups within the affected region have returned to normal.
1 previous update
Trained versions failing setup
Resolved Feb 06 at 06:30pm UTC
We've resolved this issue and have re-enabled all affected versions.
1 previous update
Delayed Model Start
Resolved Jan 29 at 06:31pm UTC
The backlog of models has been cleared. Model start time is back to expected times.
3 previous updates
Errors completing predictions
Resolved Jan 22 at 07:34pm UTC
A fix has been rolled out to the majority of models and errors rate has returned to normal levels. We will continue to monitor to address any more occurrences of the errors.
Predictions affected by this incident (many on T4 gpus, CPU, and a subset of a100s) will appear to be stuck in the starting phase for an extended period of time. These predictions can safely be cancelled and reattempted.
3 previous updates
Erroneous Alerts of Website Down
Resolved Jan 18 at 12:37am UTC
There have been a number of automated reports the Replicate website has gone down/returned to service. We are investigating the automated systems but do not see any current outages outside of the automated tooling.
Model Start times
Resolved Jan 09 at 05:23pm UTC
The incident is in process of clearing and model start times have returned to normal.
1 previous update
Slow Model Startup Time
Resolved Jan 05 at 07:13pm UTC
At this time all models in the backlog have finished startup. We will continue to monitor the situation closely.
1 previous update
Boot time issues for Models
Resolved Jan 02 at 04:46pm UTC
We are aware of an event in one of our regions that resulted in extended boot times of many models. At this time the incident has resolved. We are actively researching the root cause and will work to build remediations to limit impact of future such events.
Intermittent Failures due to networking
Resolved Dec 26 at 06:58pm UTC
The error rate seen has subsided and models are seeing previous startup and runtime behavior. We are working with our providers mitigate impact of future incidents like this.
3 previous updates
Models not starting
Resolved Dec 22 at 08:49pm UTC
The fix has been deployed and all model starts should be back to normal.
3 previous updates
Model Setup Failures
Resolved Dec 22 at 02:15am UTC
All services are working as expected and all workarounds have been restored to normal behavior. Additionally we have made improvements to ensure we can more quickly respond by adding mitigations to any future incidents of this manner.
2 previous updates
Model setup failing
Resolved Dec 21 at 12:39am UTC
Code has been rolled back and models are no longer failing setup due to this issue.
1 previous update
Models not booting
Resolved Dec 19 at 01:46pm UTC
All queues have been processed and service should be back to normally. Sorry for the interruption folks.
3 previous updates
Slow Model Startup
Resolved Dec 06 at 10:14pm UTC
We have cleared up the backlog of models seeing a slow starts.
1 previous update
NVIDIA Driver Issues
Resolved Dec 02 at 03:15pm UTC
We have identified a few nodes within one of our regions that exhibit issues with NVIDIA drivers not being installed. We have isolated these nodes from further workload scheduling (both inference and training) and will recycle the problematic nodes.
Container Images pull delays
Resolved Dec 01 at 10:34pm UTC
Thank you for your patience. We have cleared up the remaining backlog of pending workloads. Inference and Trainings are now running as expected for all hardware types.
2 previous updates