Previous incidents

July 2024
Jul 25, 2024
1 incident

Llama3-70b-chat Delays

Degraded

Resolved Jul 25 at 11:44pm UTC

This has been resolved and predictions should be handled normally.

2 previous updates

Jul 17, 2024
1 incident

Predictions on trained versions not starting

Degraded

Resolved Jul 17 at 04:36pm UTC

We've fixed the issue and predictions on trained versions are running again.

1 previous update

Jul 16, 2024
1 incident

Intermittent issues affecting some hardware types

Degraded

Resolved Jul 16 at 08:16pm UTC

Things are running normally as of about 15 minutes ago.

2 previous updates

Jul 09, 2024
1 incident

API degradation

Degraded

Resolved Jul 09 at 12:15pm UTC

Service has been restored. Thanks for your patience!

2 previous updates

Jul 03, 2024
1 incident

Llama 3 70b instruct model not processing predictions

Degraded

Resolved Jul 03 at 11:11am UTC

The model is processing predictions properly again, and the queue is empty.

1 previous update

June 2024
Jun 21, 2024
1 incident

Some models unavailable

Degraded

Resolved Jun 21 at 03:40pm UTC

Service has been restored as of a few minutes ago.

1 previous update

Jun 20, 2024
1 incident

Errors publishing model versions

Degraded

Resolved Jun 20 at 10:41pm UTC

Model version publishing is now working as expected.

1 previous update

Jun 04, 2024
1 incident

Errors with inference

Degraded

Resolved Jun 04 at 12:36am UTC

The issues with inference was limited to select LLM models. At this time the problematic code has been rolled back and all inference should be operating normally at this time.

1 previous update

May 2024
May 30, 2024
1 incident

Problems booting models in one region

Degraded

Resolved May 30 at 07:03pm UTC

All outstanding issues have been resolved. Model boots and setups should be functioning normally again.

2 previous updates

May 29, 2024
1 incident

Workloads in one of our clusters are backed up

Downtime

Resolved May 29 at 04:06pm UTC

All queues have been dealt with, and predictions and trainings are running smoothly once again.

3 previous updates

May 22, 2024
1 incident

Degraded autoscaling performance

Degraded

Resolved May 22 at 02:05pm UTC

Backlogs have been cleared and all models are now running smoothly.

3 previous updates

May 17, 2024
1 incident

5XX and slow responses

Degraded

Resolved May 17 at 12:09am UTC

The source of the problem appears to have been that our API was unable to connect to one of its underlying data stores, most likely due to a networking interruption. This has recovered as of 00:02 UTC and traffic is being served normally once again. We will continue to monitor.

1 previous update

May 09, 2024
1 incident

Webhooks not sending for Dreambooth trainings

Degraded

Resolved May 09 at 07:11pm UTC

Webhooks for Dreambooth trainings are working again.

1 previous update