Previous incidents
H100 Hardware Queueing
Resolved Nov 06 at 08:34pm UTC
After evaluation the queues impacted (predictions submitted prior to migration to the alternate region) are being truncated.
You will not be billed for predictions that are dropped in this manner, however, the predictions may appear as "in process" or "queued" for a period of time until the platform automation identifies them as dropped. It is safe to cancel and/or resubmit predictions impacted in this manner.
This truncation impacts a few thousand total predictions across all models target...
5 previous updates
File streaming not working
Resolved Nov 06 at 11:27am UTC
Our fix has rolled out and file output streaming is now working again.
1 previous update
Autoscaling Impacted for A40 and A100 class GPUs
Resolved Oct 29 at 10:05pm UTC
A40 scaling has returned to normal time frames.
3 previous updates
H100 model serving down
Resolved Oct 29 at 05:46pm UTC
We have moved traffic for the models impacted (H100 target hardware) back to the H100 class GPUs. Predictions and trainings targeting H100 class GPUs have returned to normal.
4 previous updates
Instability and delays for H100s
Resolved Oct 27 at 06:52am UTC
We're processing work without problems once again (as of about 06:30UTC) but are continuing to investigate the source of instability.
1 previous update
Some models failing to run predictions
Resolved Oct 24 at 07:00am UTC
[This is a retrospective status update published at 08:00 UTC]
We identified the problem -- we had rolled out a version of cog to our serving cluster that reintroduced a bug we'd previously fixed -- and have now rolled back that change.
1 previous update
Autoscaling impacted for A100 class hardware
Resolved Oct 22 at 04:18pm UTC
The underlying platform disruption has been resolved. Scaling for A100 Hardware has returned to normal.
1 previous update
503s on Replicate Files API
Resolved Oct 18 at 08:13pm UTC
We have deployed a permanent fix for the 503s for the files-api
2 previous updates
Prediction failures for A100 class hardware
Resolved Oct 17 at 01:51am UTC
We have rolled out a change that resolved the failure cases.
1 previous update
Delay for predictions targeting A100 GPUs
Resolved Oct 16 at 03:27am UTC
Our infrastructure provider is continuing to work on a network problem, but mitigations appear to be allowing workloads to flow normally.
5 previous updates
Predictions degraded between two data centers
Resolved Oct 15 at 12:38am UTC
We have removed an upstream from our API load balancer that seems to be accounting for all of the timeout errors and we are seeing an immediate improvement.
1 previous update
Flux-dev fine-tuning outage
Resolved Oct 13 at 12:25pm UTC
We truncated the queues, so in-flight fine-tunes will not complete and you will not be charged for them.
Fine-tunes are now working normally.
2 previous updates
Issues booting models on A40 hardware
Resolved Oct 09 at 04:34pm UTC
At this time all but a handful instances have recovered and prediction serving should be normal for the A40 hardware type.
We expect the remaining (low single digit) number of instances to be running within the next few minutes.
4 previous updates
Flux-schnell degraded
Resolved Oct 09 at 03:28pm UTC
The queue is now drained, performance is back to normal levels.
3 previous updates
A40 workloads disrupted
Resolved Oct 09 at 05:45am UTC
We've deployed a fix. Things look to be stabilizing but we are continuing to monitor.
1 previous update
Prediction serving degraded
Resolved Sep 27 at 12:25am UTC
Upon further investigation we were unable to work through the backlog of predictions. Backlogged predictions have been cancelled. It will take some time before the prediction IDs report failed in the replicate web console.
Users may resubmit any of these cancelled predictions.
All predictions that have been submitted since the last update at Sep 26 2024 at 11:52pm UTC are unaffected by this cancellation of predictions.
2 previous updates
Website availability problems
Resolved Sep 26 at 05:15pm UTC
Queues for predictions remain fairly high for black-forest-labs/flux-schnell and meta/meta-llama-3.1-405b-instruct. All other models should be behaving normally.
2 previous updates
Website unavailable
Resolved Sep 13 at 03:18pm UTC
Things have been running normally for at least the last 10 minutes. This incident was -- ironically -- triggered by work we're doing to improve the overall performance and reliability of our primary database. We apologise for the disruption.
3 previous updates
Prediction Service Normal
Resolved Sep 01 at 07:41pm UTC
We were alerted to a potential issue with prediction serving. Upon investigation, one of our providers used to monitor is seeing an outage impacting some automated monitoring. We've taken steps to isolate the problematic monitors while our provider works to resolve the issue.