Previous incidents
Elevated Error Rates on API
Resolved Mar 22 at 04:31am UTC
We noticed elevated error rates (500 class responses) on our API. Investigation of the errors resulted in discovering one of the APIs in the primary loadbalancer was having issues making requests to one of our serving regions.
Our engineers have temporarily removed this api endpoint from production traffic while we investigate.
The elevated error rate has returned to normal.
Delays for L40S hardware
Resolved Mar 13 at 05:12pm UTC
We are back under capacity for L40S hardware. Thanks for waiting!
1 previous update
Delays for models on L40S hardware type
Resolved Mar 11 at 11:31pm UTC
We are back below capacity limits for the L40S hardware type. Thanks for your patience!
1 previous update
Delays for predictions on L40S hardware
Resolved Mar 10 at 10:29pm UTC
Most models running on L40S hardware should not be experiencing delays. We are still seeing a handful of models unable to setup due to download rate limiting from a few external providers, but we're going to continue working on that as a separate problem. Thanks for waiting!
2 previous updates
Disruption of Prediction Serving
Resolved Mar 08 at 08:26am UTC
All backlog is being worked through and at this point all services have been restored to full functionality.
Again thank you for your patience.
5 previous updates
Prediction Serving Disruption
Resolved Feb 24 at 03:31pm UTC
Replicate was altered to a brief issue with prediction creation, update, and completion. There was a window for about 5 minutes starting at 2025-02-24 15:22:30 UTC.
A database update caused a brief disruption causing delays in persisting data. At this time the Replicate platform has resumed normal operations.
Webhook delivery impacted on CPU, L40S and H100 hardware
Resolved Feb 24 at 11:43am UTC
Things have been stable for 15 minutes now. We believe this to be resolved.
3 previous updates
Webhook delivery degraded for A100 hardware
Resolved Feb 21 at 05:23pm UTC
Webhooks are now being delivered in a timely fashion. Thanks for your patience!
1 previous update
High capacity utilization
Resolved Feb 19 at 08:07am UTC
We are back at full capacity. Thanks for your patience!
3 previous updates
Some models failing to setup on A100 hardware
Resolved Feb 19 at 01:49am UTC
The rollback of the suspected misconfiguration is complete and all queues have recovered. Thanks for your patience!
1 previous update
Predictions degraded for L40S, H100, and CPU hardware types
Resolved Feb 07 at 06:55pm UTC
We are now caught up and running below capacity. Thanks for your patience!
3 previous updates
API instability for L40S, H100 and CPU workloads.
Resolved Feb 06 at 03:20pm UTC
This issue appears to have been a result of another bandwidth spike partly as a result of our incident earlier today. The issue has now been resolved. We are going to be working to prevent incidents of this kind from recurring.
1 previous update
Setup failures on L40S and H100 hardware
Resolved Feb 06 at 10:45am UTC
This incident is now resolved.
4 previous updates
Prediction creation unavailable for L40S and H100 hardware
Resolved Feb 03 at 06:30am UTC
The cache used by the API for predictions was misconfigured for a period of ~20 minutes beginning at 20:34 UTC until a rollback completed at 20:56 UTC. Models using the L40S and H100 hardware types were affected. During the period of misconfiguration, prediction creation was severely limited, resulting in many API responses with status 503.
Instability and delays for H100 and L40S
Resolved Jan 30 at 09:56am UTC
The networking issue with our provider was resolved at 0940 UTC, and all requests have been running normally since then.
2 previous updates
Billing and metric delays
Resolved Jan 15 at 04:29pm UTC
The background jobs are running again, and we've caught up to present as of about 1609 UTC (20 minutes ago).
1 previous update
Dashboard inaccessible due to redirect
Resolved Jan 09 at 04:37pm UTC
The redirect has been reverted and the dashboard should be accessible again.
1 previous update