Previous incidents
Degraded Service
Resolved Apr 12 at 05:20am UTC
At this time service has been restored. All inference (prediction serving) and model instance starts have returned to normal.
2 previous updates
Degraded service
Resolved Apr 11 at 02:05pm UTC
Our systems indicate that the problem has been resolved. We will continue monitoring the situation.
2 previous updates
A40 models scaling slowly
Resolved Mar 14 at 09:19am UTC
All but a very small slice of our A40 hardware is back online, and Replicate workloads are processing normally. We again thank you for your patience.
5 previous updates
Errors within one region
Resolved Mar 06 at 04:06am UTC
Workloads across all regions are now running normally. We apologise for the disruption, and will working to better improve our ability to shift load between providers in situations like this one.
4 previous updates
API, Inference, Web Page
Resolved Mar 05 at 06:54am UTC
We identified excessive load on our database. Shortly after the root cause was isolated, our engineering team disabled the problematic code paths.
Degradation of database responsiveness resulted in the general service outage beginning at approximately 6:23UTC.
At this time API, Inference, and Web are fully functional. Any predictions that resulted in errors or were marked as failed during this window can be safely re-run.
1 previous update
Models Affected by Hugging Face Hub Outage
Resolved Feb 29 at 01:57am UTC
We are seeing HF Hub return to full functionality. The backlog of models blocked on interacting (downloading or otherwise) with HuggingFace have recovered.
1 previous update
Create prediction/training API unavailable
Resolved Feb 26 at 01:13pm UTC
From approximately 13:05 to 13:11 UTC, our prediction and training creation endpoints were unavailable. Existing predictions and trainings were unaffected, but no new predictions or trainings could be created.
The problem has since been resolved.
Dropped predictions
Resolved Feb 22 at 05:46pm UTC
This issue has been resolved and predictions are now flowing again.
1 previous update
API errors
Resolved Feb 22 at 09:53am UTC
The problems resolved automatically at 09:41 UTC. We are monitoring the situation.
1 previous update
Models stuck booting
Resolved Feb 19 at 04:27pm UTC
The models stuck in booting have been fixed, and inference and trainings are both running normally again.
2 previous updates
Model startup Errors / Runtime Download Errors
Resolved Feb 16 at 11:52pm UTC
Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery.
For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up f...
5 previous updates
Errors downloading weights on model startup
Resolved Feb 10 at 11:16pm UTC
We've not seen any failures after 22:50 UTC, so we're calling this incident resolved.
Our investigation revealed that internal DNS lookup failures put a storage cache subsystem into a broken state. Next week we'll be looking into how to make our systems more robust in situations like this one.
Thank you for your patience.
3 previous updates
Errors downloading weights on model startup
Resolved Feb 08 at 10:00pm UTC
Mitigations are in place and now models are again downloading weights as expected.
All model startups within the affected region have returned to normal.
1 previous update
Trained versions failing setup
Resolved Feb 06 at 06:30pm UTC
We've resolved this issue and have re-enabled all affected versions.
1 previous update