Previous incidents

March 2025
Mar 22, 2025
1 incident

Elevated Error Rates on API

Resolved Mar 22 at 04:31am UTC

We noticed elevated error rates (500 class responses) on our API. Investigation of the errors resulted in discovering one of the APIs in the primary loadbalancer was having issues making requests to one of our serving regions.

Our engineers have temporarily removed this api endpoint from production traffic while we investigate.

The elevated error rate has returned to normal.

Mar 13, 2025
1 incident

Delays for L40S hardware

Degraded

Resolved Mar 13 at 05:12pm UTC

We are back under capacity for L40S hardware. Thanks for waiting!

1 previous update

Mar 11, 2025
1 incident

Delays for models on L40S hardware type

Degraded

Resolved Mar 11 at 11:31pm UTC

We are back below capacity limits for the L40S hardware type. Thanks for your patience!

1 previous update

Mar 10, 2025
1 incident

Delays for predictions on L40S hardware

Degraded

Resolved Mar 10 at 10:29pm UTC

Most models running on L40S hardware should not be experiencing delays. We are still seeing a handful of models unable to setup due to download rate limiting from a few external providers, but we're going to continue working on that as a separate problem. Thanks for waiting!

2 previous updates

Mar 08, 2025
1 incident

Disruption of Prediction Serving

Degraded

Resolved Mar 08 at 08:26am UTC

All backlog is being worked through and at this point all services have been restored to full functionality.

Again thank you for your patience.

5 previous updates

February 2025
Feb 24, 2025
2 incidents

Prediction Serving Disruption

Resolved Feb 24 at 03:31pm UTC

Replicate was altered to a brief issue with prediction creation, update, and completion. There was a window for about 5 minutes starting at 2025-02-24 15:22:30 UTC.

A database update caused a brief disruption causing delays in persisting data. At this time the Replicate platform has resumed normal operations.

Webhook delivery impacted on CPU, L40S and H100 hardware

Degraded

Resolved Feb 24 at 11:43am UTC

Things have been stable for 15 minutes now. We believe this to be resolved.

3 previous updates

Feb 21, 2025
1 incident

Webhook delivery degraded for A100 hardware

Degraded

Resolved Feb 21 at 05:23pm UTC

Webhooks are now being delivered in a timely fashion. Thanks for your patience!

1 previous update

Feb 19, 2025
2 incidents

High capacity utilization

Degraded

Resolved Feb 19 at 08:07am UTC

We are back at full capacity. Thanks for your patience!

3 previous updates

Some models failing to setup on A100 hardware

Degraded

Resolved Feb 19 at 01:49am UTC

The rollback of the suspected misconfiguration is complete and all queues have recovered. Thanks for your patience!

1 previous update

Feb 07, 2025
1 incident

Predictions degraded for L40S, H100, and CPU hardware types

Degraded

Resolved Feb 07 at 06:55pm UTC

We are now caught up and running below capacity. Thanks for your patience!

3 previous updates

Feb 06, 2025
2 incidents

API instability for L40S, H100 and CPU workloads.

Degraded

Resolved Feb 06 at 03:20pm UTC

This issue appears to have been a result of another bandwidth spike partly as a result of our incident earlier today. The issue has now been resolved. We are going to be working to prevent incidents of this kind from recurring.

1 previous update

Setup failures on L40S and H100 hardware

Degraded

Resolved Feb 06 at 10:45am UTC

This incident is now resolved.

4 previous updates

Feb 03, 2025
1 incident

Prediction creation unavailable for L40S and H100 hardware

Resolved Feb 03 at 06:30am UTC

The cache used by the API for predictions was misconfigured for a period of ~20 minutes beginning at 20:34 UTC until a rollback completed at 20:56 UTC. Models using the L40S and H100 hardware types were affected. During the period of misconfiguration, prediction creation was severely limited, resulting in many API responses with status 503.

January 2025
Jan 30, 2025
1 incident

Instability and delays for H100 and L40S

Degraded

Resolved Jan 30 at 09:56am UTC

The networking issue with our provider was resolved at 0940 UTC, and all requests have been running normally since then.

2 previous updates

Jan 15, 2025
1 incident

Billing and metric delays

Degraded

Resolved Jan 15 at 04:29pm UTC

The background jobs are running again, and we've caught up to present as of about 1609 UTC (20 minutes ago).

1 previous update

Jan 09, 2025
1 incident

Dashboard inaccessible due to redirect

Degraded

Resolved Jan 09 at 04:37pm UTC

The redirect has been reverted and the dashboard should be accessible again.

1 previous update