Delay for predictions targeting A100 GPUs
Resolved
Oct 16 at 03:27am UTC
Our infrastructure provider is continuing to work on a network problem, but mitigations appear to be allowing workloads to flow normally.
Affected services
Prediction serving
Updated
Oct 16 at 03:09am UTC
We are seeing improved network performance which is having a positive impact on scaling speed and queue wait times.
Affected services
Prediction serving
Updated
Oct 16 at 02:00am UTC
Our provider has made some changes to help mitigate the slow boot times from the A100 hardware instances. Some of these instances have successfully booted and handled predictions.
We continue to work towards identifying the underlying issue and correcting it for all A100-class instances.
Affected services
Prediction serving
Updated
Oct 16 at 01:18am UTC
We are continuing to work with our infrastructure provider, as we are still seeing much longer delays than expected in resource freeing. We are also investigating a drop in network throughput.
Affected services
Prediction serving
Updated
Oct 16 at 12:35am UTC
We are continuing to work with our infrastructure provider to free up resources, and we are beginning to see workloads scale out correctly.
Affected services
Prediction serving
Created
Oct 15 at 11:47pm UTC
Predictions targeting A100 GPUs are delayed, and we are investigating issues with hovering close to capacity due to delayed resource freeing for completed work.
Affected services
Prediction serving