Previous incidents

December 2024
No incidents reported
November 2024
Nov 26, 2024
1 incident

API errors and request delays

Degraded

Resolved Nov 26 at 07:16pm UTC

We're seeing healthy behavior since our upstream provider applied further fixes in the last hour. We will be sharing further details of how this happened once they are available.

5 previous updates

Nov 19, 2024
1 incident

Flux Dev Inference Delays

Degraded

Resolved Nov 19 at 01:11am UTC

The back log of predictions has been worked through and we are seeing normal prediction times return. Thank you for your patience.

2 previous updates

Nov 17, 2024
1 incident

Flux Dev Prediction Delays

Degraded

Resolved Nov 17 at 02:35am UTC

A of 0223 UTC Nov 17th the backlog has been processed and Flux Dev is handling requests as expected.

1 previous update

Nov 14, 2024
1 incident

Predictions failing for H100 hardware

Degraded

Resolved Nov 14 at 03:19am UTC

We have identified a hardware failure and have isolated the affected node(s). We are seeing a return to normal service for H100-targeted predictions and trainings.

1 previous update

Nov 11, 2024
1 incident

Flux Pro, Recraft, and Ideogram failed predictions

Resolved Nov 11 at 05:10pm UTC

We identified an internal component that caused errors with Flux Pro, Recraft, and Ideogram models. The errors occurred between approximately 1540 UTC and 1709 UTC on November 11, 2024.

As of the time this status update is published, the internal component has been rolled back and we are seeing normal prediction handling for impacted models.

Nov 08, 2024
1 incident

Prediction delays for black-forest-labs/flux-1.1-pro and meta/meta-llama-3-70...

Resolved Nov 08 at 07:05pm UTC

Between the hours of 16:30 - 19:00 UTC, predictions sent to flux-1.1-pro and meta-llama-3-70b-instruct were delayed by up to 1 hour. This was the result of a rollout of an internal component that broke a small number of models which we then rolled back. Due to the high volume of predictions handled by these two models, the backlog grew fairly quickly, but we have now caught up with ...

Nov 06, 2024
2 incidents

H100 Hardware Queueing

Degraded

Resolved Nov 06 at 08:34pm UTC

After evaluation the queues impacted (predictions submitted prior to migration to the alternate region) are being truncated.

You will not be billed for predictions that are dropped in this manner, however, the predictions may appear as "in process" or "queued" for a period of time until the platform automation identifies them as dropped. It is safe to cancel and/or resubmit predictions impacted in this manner.

This truncation impacts a few thousand total predictions across all models target...

5 previous updates

File streaming not working

Degraded

Resolved Nov 06 at 11:27am UTC

Our fix has rolled out and file output streaming is now working again.

1 previous update

October 2024
Oct 29, 2024
2 incidents

Autoscaling Impacted for A40 and A100 class GPUs

Degraded

Resolved Oct 29 at 10:05pm UTC

A40 scaling has returned to normal time frames.

3 previous updates

H100 model serving down

Downtime

Resolved Oct 29 at 05:46pm UTC

We have moved traffic for the models impacted (H100 target hardware) back to the H100 class GPUs. Predictions and trainings targeting H100 class GPUs have returned to normal.

4 previous updates

Oct 27, 2024
1 incident

Instability and delays for H100s

Degraded

Resolved Oct 27 at 06:52am UTC

We're processing work without problems once again (as of about 06:30UTC) but are continuing to investigate the source of instability.

1 previous update

Oct 24, 2024
1 incident

Some models failing to run predictions

Degraded

Resolved Oct 24 at 07:00am UTC

[This is a retrospective status update published at 08:00 UTC]

We identified the problem -- we had rolled out a version of cog to our serving cluster that reintroduced a bug we'd previously fixed -- and have now rolled back that change.

1 previous update

Oct 22, 2024
1 incident

Autoscaling impacted for A100 class hardware

Degraded

Resolved Oct 22 at 04:18pm UTC

The underlying platform disruption has been resolved. Scaling for A100 Hardware has returned to normal.

1 previous update

Oct 18, 2024
1 incident

503s on Replicate Files API

Degraded

Resolved Oct 18 at 08:13pm UTC

We have deployed a permanent fix for the 503s for the files-api

2 previous updates

Oct 17, 2024
1 incident

Prediction failures for A100 class hardware

Degraded

Resolved Oct 17 at 01:51am UTC

We have rolled out a change that resolved the failure cases.

1 previous update

Oct 15, 2024
2 incidents

Delay for predictions targeting A100 GPUs

Degraded

Resolved Oct 16 at 03:27am UTC

Our infrastructure provider is continuing to work on a network problem, but mitigations appear to be allowing workloads to flow normally.

5 previous updates

Predictions degraded between two data centers

Degraded

Resolved Oct 15 at 12:38am UTC

We have removed an upstream from our API load balancer that seems to be accounting for all of the timeout errors and we are seeing an immediate improvement.

1 previous update

Oct 13, 2024
1 incident

Flux-dev fine-tuning outage

Degraded

Resolved Oct 13 at 12:25pm UTC

We truncated the queues, so in-flight fine-tunes will not complete and you will not be charged for them.

Fine-tunes are now working normally.

2 previous updates

Oct 09, 2024
3 incidents

Issues booting models on A40 hardware

Degraded

Resolved Oct 09 at 04:34pm UTC

At this time all but a handful instances have recovered and prediction serving should be normal for the A40 hardware type.

We expect the remaining (low single digit) number of instances to be running within the next few minutes.

4 previous updates

Flux-schnell degraded

Degraded

Resolved Oct 09 at 03:28pm UTC

The queue is now drained, performance is back to normal levels.

3 previous updates

A40 workloads disrupted

Downtime

Resolved Oct 09 at 05:45am UTC

We've deployed a fix. Things look to be stabilizing but we are continuing to monitor.

1 previous update