Back to overview
Degraded

Delay for predictions targeting A100 GPUs

Oct 15 at 11:47pm UTC
Affected services
Prediction serving

Resolved
Oct 16 at 03:27am UTC

Our infrastructure provider is continuing to work on a network problem, but mitigations appear to be allowing workloads to flow normally.

Updated
Oct 16 at 03:09am UTC

We are seeing improved network performance which is having a positive impact on scaling speed and queue wait times.

Updated
Oct 16 at 02:00am UTC

Our provider has made some changes to help mitigate the slow boot times from the A100 hardware instances. Some of these instances have successfully booted and handled predictions.

We continue to work towards identifying the underlying issue and correcting it for all A100-class instances.

Updated
Oct 16 at 01:18am UTC

We are continuing to work with our infrastructure provider, as we are still seeing much longer delays than expected in resource freeing. We are also investigating a drop in network throughput.

Updated
Oct 16 at 12:35am UTC

We are continuing to work with our infrastructure provider to free up resources, and we are beginning to see workloads scale out correctly.

Created
Oct 15 at 11:47pm UTC

Predictions targeting A100 GPUs are delayed, and we are investigating issues with hovering close to capacity due to delayed resource freeing for completed work.