Previous incidents
Web and predictions degraded
Resolved Sep 29 at 04:12pm UTC
We have now fully resolved the issues and API and replicate.com website are fully operational.
3 previous updates
Predictions and training degraded for one cloud provider
Resolved Sep 28 at 02:52pm UTC
The API and website are now working as expected for predictions and trainings.
1 previous update
Degraded API / API Errors
Resolved Sep 27 at 07:31pm UTC
We have identified a problematic ingress pod and have caused it to reschedule. The API and website are now working as expected for predictions and trainings.
1 previous update
Website downtime / API degraded
Resolved Sep 27 at 05:56pm UTC
We have rolled back the problematic change. Website functionality has been restored and API error rate has returned to normal.
1 previous update
Slow start on some predictions and trainings (A40 and some A100)
Resolved Sep 21 at 09:06pm UTC
We have worked through the pending predictions and trainings and now see normal start times.
4 previous updates
System unavailable
Resolved Sep 21 at 08:37pm UTC
We have recovered our caching service and see predictions and training succeeding.
1 previous update
Temporary capacity issues with 8xA40 hardware type
Resolved Sep 21 at 12:39am UTC
We resolved the capacity issues.
2 previous updates
Primary database outage
Resolved Sep 19 at 09:17pm UTC
Both the API and web are now back to normal. Predictions, trainings are functioning as expected.
We are continuing to monitor things.
2 previous updates
API degraded
Resolved Sep 18 at 04:50pm UTC
API is now behaving normally.
1 previous update
Web and predictions degraded
Resolved Sep 08 at 12:37pm UTC
Everything is resolved and back to normal.
During the downtime predictions were completing normally in API, but are not persisted.
1 previous update
Degraded Prediction and Training Start Times
Resolved Sep 07 at 08:06pm UTC
The issue with the upstream provider has been resolved. Predictions and Trainings are expected to be starting within normal timeframes.
1 previous update
Degraded Prediction Handling
Resolved Sep 06 at 04:15pm UTC
Prediction processing and prediction are working as expected now.
1 previous update
Issues starting predictions
Resolved Sep 01 at 08:45pm UTC
Everything should be working normally at this time.
4 previous updates
Issues scheduling to certain hardware
Resolved Aug 19 at 11:41pm UTC
Thank you for your patience. At this time all hung workloads targeted for the T4 hardware should no longer be stuck in starting phase.
2 previous updates
Replicate Web Down
Resolved Aug 18 at 03:22pm UTC
Engineers have rolled back a change to the website and at this time the website should now be responding as expected.
4 previous updates
Webside and API Outage
Resolved Aug 15 at 07:28pm UTC
Reverting the identified change and purging known bad cache values has resolved the error rate within the API service. API and Web should be responding as expected at this time.
2 previous updates
Delays starting some models
Resolved Aug 11 at 03:13pm UTC
We believe that as of a few minutes ago the last customer impact from this issue has been resolved and all queues have cleared. To help you correlate this incident with any issues you may have seen: as far as we can tell the earliest customer impact from this incident started at about 11:00 UTC today.
3 previous updates
Models not booting
Resolved Aug 09 at 10:14pm UTC
The fix has been rolled out for all models.
2 previous updates
520 error responses from API
Resolved Aug 03 at 01:48pm UTC
We've identified the source of the errors -- a global load balancing service appears to have been misbehaving -- and made changes to how we serve api.replicate.com to mitigate the problem. As of a few minutes ago, we are no longer serving 520 error responses to customers.
1 previous update
Replicate website unavailable
Resolved Aug 02 at 04:48pm UTC
We're back! We pushed a bad change and have rolled it back. Sorry for the inconvenience.
1 previous update
Prediction requests failing
Resolved Jul 31 at 12:15pm UTC
All prediction requests are now responding normally. We're still investigating the underlying cause.
1 previous update
API errors/timeouts
Resolved Jul 28 at 06:10pm UTC
The API is fully recovered. Unfortunately we are still at least partially in the dark about what triggered these problems. We're continuing to investigate.
2 previous updates
API errors/timeouts
Resolved Jul 28 at 05:32am UTC
Services have recovered. We'll be following up with our provider to understand how the scope of the planned maintenance expanded to affect customer workloads.
4 previous updates
API errors
Resolved Jul 26 at 10:51am UTC
We've identified a service that was starved of compute resources and addressed that problem. Service has been restored.
1 previous update
Prediction creation errors
Resolved Jul 19 at 06:35pm UTC
We've restored service to the queueing system and predictions are flowing again.
2 previous updates