Incidents | Replicate Incidents reported on status page for Replicate https://replicatestatus.com/ https://d1lppblt9t2x15.cloudfront.net/logos/e972e49c725bd7fa7d5ba11517d9b945.png Incidents | Replicate https://replicatestatus.com/ en Prediction serving recovered https://replicatestatus.com/ Thu, 27 Mar 2025 15:36:44 +0000 https://replicatestatus.com/#be92083cca20d7ce513faf7ccce2c9d4c2033263ac28238edeaa884f53ab28bf Prediction serving recovered Prediction serving went down https://replicatestatus.com/ Thu, 27 Mar 2025 15:28:51 +0000 https://replicatestatus.com/#be92083cca20d7ce513faf7ccce2c9d4c2033263ac28238edeaa884f53ab28bf Prediction serving went down Elevated Error Rates on API https://replicatestatus.com/incident/532569 Sat, 22 Mar 2025 04:31:00 -0000 https://replicatestatus.com/incident/532569#1240a08808e4e5c14974738ebaa01aa13de6c356d29e9b04f7db9a8f03bd542b We noticed elevated error rates (500 class responses) on our API. Investigation of the errors resulted in discovering one of the APIs in the primary loadbalancer was having issues making requests to one of our serving regions. Our engineers have temporarily removed this api endpoint from production traffic while we investigate. The elevated error rate has returned to normal. Prediction serving recovered https://replicatestatus.com/ Wed, 19 Mar 2025 11:37:45 +0000 https://replicatestatus.com/#46680541743364575bfc709a6ea095ec662bf6243146222e9c91738b8f54abfd Prediction serving recovered Prediction serving went down https://replicatestatus.com/ Wed, 19 Mar 2025 11:36:08 +0000 https://replicatestatus.com/#46680541743364575bfc709a6ea095ec662bf6243146222e9c91738b8f54abfd Prediction serving went down Homepage recovered https://replicatestatus.com/ Mon, 17 Mar 2025 09:22:20 +0000 https://replicatestatus.com/#8dac8c9a435bc71b38e583c3aae7a04d4086b2e260ec84fade34c4c5056526aa Homepage recovered Homepage went down https://replicatestatus.com/ Mon, 17 Mar 2025 08:58:19 +0000 https://replicatestatus.com/#8dac8c9a435bc71b38e583c3aae7a04d4086b2e260ec84fade34c4c5056526aa Homepage went down Delays for L40S hardware https://replicatestatus.com/incident/527734 Thu, 13 Mar 2025 17:12:00 -0000 https://replicatestatus.com/incident/527734#f9a9da833cc93bcbceeff8ff58de15b4676aee47a328af32092d717db6776cde We are back under capacity for L40S hardware. Thanks for waiting! Delays for L40S hardware https://replicatestatus.com/incident/527734 Thu, 13 Mar 2025 16:31:00 -0000 https://replicatestatus.com/incident/527734#73a86161d2819381e2f7579dc84abe9580868ba9113a19680afb81edf77453a1 We are currently scheduled to capacity of our L40S hardware type. Any models running with this hardware type may experience delays in scale out and predictions. Delays for models on L40S hardware type https://replicatestatus.com/incident/526644 Tue, 11 Mar 2025 23:31:00 -0000 https://replicatestatus.com/incident/526644#cd15cb060f813e591b0dacd9b1266ae99366f95cf5b1560c923fdac1562c4262 We are back below capacity limits for the L40S hardware type. Thanks for your patience! Delays for models on L40S hardware type https://replicatestatus.com/incident/526644 Tue, 11 Mar 2025 21:49:00 -0000 https://replicatestatus.com/incident/526644#11764145b17107a4a38d49d223d7482b705f691b6feaac0294c4a72fad37a383 We are hitting capacity limits for the L40S hardware type, although we are also seeing that many of the running workloads are failing to setup within 10 minutes seemingly due to download throttling. This is effectively the same problem we saw yesterday, but the mitigations we saw succeeding yesterday don't appear to be working today. Delays for predictions on L40S hardware https://replicatestatus.com/incident/525911 Mon, 10 Mar 2025 22:29:00 -0000 https://replicatestatus.com/incident/525911#b3768f02e04cf3248f786573599d613a2e9613f98939937b14d2a57930a3c3d9 Most models running on L40S hardware should not be experiencing delays. We are still seeing a handful of models unable to setup due to download rate limiting from a few external providers, but we're going to continue working on that as a separate problem. Thanks for waiting! Delays for predictions on L40S hardware https://replicatestatus.com/incident/525911 Mon, 10 Mar 2025 20:48:00 -0000 https://replicatestatus.com/incident/525911#a21e494e1ddc8cac31550c44c348c61f99f7549727f6441d6a95ba11d6849424 We are still in the process of recovering, and most models are working correctly at this point. The remaining models that have backed up queues all appear to be hitting download rate limiting from external providers. We are working with the relevant providers to see if any interventions are available. Delays for predictions on L40S hardware https://replicatestatus.com/incident/525911 Mon, 10 Mar 2025 18:15:00 -0000 https://replicatestatus.com/incident/525911#9678695535769ab77457d7aa1ec0cbe173e8bc469d877b9d0558d288fa551836 We are handling a surge of demand for models running with the L40S hardware type which is resulting in delayed predictions. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 08:26:00 -0000 https://replicatestatus.com/incident/524687#a8605ba6fce59fc8ca2754086a7fb52949b82d629dcfaecbd5102809fd552e52 All backlog is being worked through and at this point all services have been restored to full functionality. Again thank you for your patience. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 08:26:00 -0000 https://replicatestatus.com/incident/524687#a8605ba6fce59fc8ca2754086a7fb52949b82d629dcfaecbd5102809fd552e52 All backlog is being worked through and at this point all services have been restored to full functionality. Again thank you for your patience. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 07:50:00 -0000 https://replicatestatus.com/incident/524687#b2dcab25b53cad48b0309a74c9e3b3d9f178f81678d6d80b190b7c865b46a842 At this time most services and prediction service has returned to normal. We are seeing elevated and incorrect contention for the GPU types (L40S and H100) making scaling of new instances slower than expected. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 07:50:00 -0000 https://replicatestatus.com/incident/524687#b2dcab25b53cad48b0309a74c9e3b3d9f178f81678d6d80b190b7c865b46a842 At this time most services and prediction service has returned to normal. We are seeing elevated and incorrect contention for the GPU types (L40S and H100) making scaling of new instances slower than expected. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 07:21:00 -0000 https://replicatestatus.com/incident/524687#50c12da4b1f523c71c7f0b219cbbaacfb08479c94754755679abef997c426fde L40S, H100, and CPU hardware types continue to see degraded prediction performance. Subsequent failures are being seen as we work through our backlogs. We will provide updates as information becomes available. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 07:21:00 -0000 https://replicatestatus.com/incident/524687#50c12da4b1f523c71c7f0b219cbbaacfb08479c94754755679abef997c426fde L40S, H100, and CPU hardware types continue to see degraded prediction performance. Subsequent failures are being seen as we work through our backlogs. We will provide updates as information becomes available. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 06:54:00 -0000 https://replicatestatus.com/incident/524687#42c9eae639a82b382622f24b4ef5f592998ed5c9d847b221bbeee26d9a9d0d30 As of this time, all services have been restored and the faulty hardware has been removed from production. Thank you for your patience. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 06:54:00 -0000 https://replicatestatus.com/incident/524687#42c9eae639a82b382622f24b4ef5f592998ed5c9d847b221bbeee26d9a9d0d30 As of this time, all services have been restored and the faulty hardware has been removed from production. Thank you for your patience. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 06:37:00 -0000 https://replicatestatus.com/incident/524687#69edd278d356a520d98c019ba1b201317ff654c635a202d3bed2b3348017149d Critical Services have been successfully migrated. As our final services come back online we are monitoring and expecting continued degraded service. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 06:37:00 -0000 https://replicatestatus.com/incident/524687#69edd278d356a520d98c019ba1b201317ff654c635a202d3bed2b3348017149d Critical Services have been successfully migrated. As our final services come back online we are monitoring and expecting continued degraded service. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 06:25:00 -0000 https://replicatestatus.com/incident/524687#28fc9734d0453826629be74d42d5e45947a00bf0a1059cb84789b0143c3413f6 We are experiencing a disruption in prediction serving for the H200, L40S, and CPU hardware types due to a hardware failure. We are in process of bringing critical services back online. Disruption of Prediction Serving https://replicatestatus.com/incident/524687 Sat, 08 Mar 2025 06:25:00 -0000 https://replicatestatus.com/incident/524687#28fc9734d0453826629be74d42d5e45947a00bf0a1059cb84789b0143c3413f6 We are experiencing a disruption in prediction serving for the H200, L40S, and CPU hardware types due to a hardware failure. We are in process of bringing critical services back online. Homepage recovered https://replicatestatus.com/ Fri, 07 Mar 2025 15:52:01 +0000 https://replicatestatus.com/#455949986525b1187641f4b38a3973a4fadc3f757dcbe9669d83636d6d3255c7 Homepage recovered Homepage went down https://replicatestatus.com/ Fri, 07 Mar 2025 15:51:25 +0000 https://replicatestatus.com/#455949986525b1187641f4b38a3973a4fadc3f757dcbe9669d83636d6d3255c7 Homepage went down Prediction Serving Disruption https://replicatestatus.com/incident/518198 Mon, 24 Feb 2025 15:31:00 -0000 https://replicatestatus.com/incident/518198#562aa6c7c0c0cef8f489a4f8888bd6c647bd45436881a4d1b144524834d1d765 Replicate was altered to a brief issue with prediction creation, update, and completion. There was a window for about 5 minutes starting at 2025-02-24 15:22:30 UTC. A database update caused a brief disruption causing delays in persisting data. At this time the Replicate platform has resumed normal operations. Webhook delivery impacted on CPU, L40S and H100 hardware https://replicatestatus.com/incident/518050 Mon, 24 Feb 2025 11:43:00 -0000 https://replicatestatus.com/incident/518050#1a4f4a692e0a8ae5fa191f5d7dd5098a5fef6816f9a64bcb887b68dae8471225 Things have been stable for 15 minutes now. We believe this to be resolved. Webhook delivery impacted on CPU, L40S and H100 hardware https://replicatestatus.com/incident/518050 Mon, 24 Feb 2025 11:31:00 -0000 https://replicatestatus.com/incident/518050#27daed24df83322b58ab75b98a2376a2ece440d8f7c8bd11c6aa9da5145da64b We're still seeing occasional bursts of webhook errors, but still looking much healthier than before. Webhook delivery impacted on CPU, L40S and H100 hardware https://replicatestatus.com/incident/518050 Mon, 24 Feb 2025 11:18:00 -0000 https://replicatestatus.com/incident/518050#56a2fd7e4fb13149b32fbf21a1ce76517f8db95052851236a1349dc19345f844 It looks like things have now recovered, but we are continuing to monitor the situation. Webhooks were degraded from 10:51 to 11:17 UTC. Webhook delivery impacted on CPU, L40S and H100 hardware https://replicatestatus.com/incident/518050 Mon, 24 Feb 2025 11:13:00 -0000 https://replicatestatus.com/incident/518050#86de1da527e3e42287cd9eb695582dc17cd5c2674ac3c8d750b03ef1e59c09f8 Webhook delivery has been severely degraded for CPU, L40S and H100 hardware types. We are investigating and seeing things improve. Webhook delivery degraded for A100 hardware https://replicatestatus.com/incident/516841 Fri, 21 Feb 2025 17:23:00 -0000 https://replicatestatus.com/incident/516841#bad948c1f8e6d2998488b94cb8f436709862de10c618c2f82f51ce01a774a224 Webhooks are now being delivered in a timely fashion. Thanks for your patience! Webhook delivery degraded for A100 hardware https://replicatestatus.com/incident/516841 Fri, 21 Feb 2025 17:08:00 -0000 https://replicatestatus.com/incident/516841#8778a6a32a683a95e0bc39665c5971f9078e3bc88b86d4b3f90df38e810d1bb9 We are seeing a drop in webhook delivery success for all predictions on A100 hardware type. We suspect this is due to an unscheduled restart of our storage cache layer which resulted in a thundering herd of upstream requests to warm the cache, effectively using all of our available network capacity. High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 08:07:00 -0000 https://replicatestatus.com/incident/515287#9eff42876dae884f2c5aaea77037ac20b722682b7b2ba3736946d8b60b34df57 We are back at full capacity. Thanks for your patience! High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 08:07:00 -0000 https://replicatestatus.com/incident/515287#9eff42876dae884f2c5aaea77037ac20b722682b7b2ba3736946d8b60b34df57 We are back at full capacity. Thanks for your patience! High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 06:04:00 -0000 https://replicatestatus.com/incident/515287#c12d5046890f731baae34426cc048bc4bc46f0207a2e376e58508e57e71f9535 We identified an unexpected scaling configuration applied to a large number of deployments that seems to account for all of the resources exhaustion. We rolled out a change to our autoscaling to ignore this particular scaling configuration and now are seeing recovery. High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 06:04:00 -0000 https://replicatestatus.com/incident/515287#c12d5046890f731baae34426cc048bc4bc46f0207a2e376e58508e57e71f9535 We identified an unexpected scaling configuration applied to a large number of deployments that seems to account for all of the resources exhaustion. We rolled out a change to our autoscaling to ignore this particular scaling configuration and now are seeing recovery. High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 04:50:00 -0000 https://replicatestatus.com/incident/515287#6a4c7712585d32f7ef249f2bc2dad3b2e52dc81c8ad89fc837605f1cd88cf4ba We are continuing the investigation into the source of the incorrect increased demand for hardware. At this time we expect all instances to take significant time to boot. High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 04:50:00 -0000 https://replicatestatus.com/incident/515287#6a4c7712585d32f7ef249f2bc2dad3b2e52dc81c8ad89fc837605f1cd88cf4ba We are continuing the investigation into the source of the incorrect increased demand for hardware. At this time we expect all instances to take significant time to boot. High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 04:07:00 -0000 https://replicatestatus.com/incident/515287#c6b280c6c07c9c5b0ad372457b91d3be9e477b9f4daa2086ccad596fb16baad3 All hardware types are seeing unexpectedly high demand. We are investigating the root cause of this excessive demand. High capacity utilization https://replicatestatus.com/incident/515287 Wed, 19 Feb 2025 04:07:00 -0000 https://replicatestatus.com/incident/515287#c6b280c6c07c9c5b0ad372457b91d3be9e477b9f4daa2086ccad596fb16baad3 All hardware types are seeing unexpectedly high demand. We are investigating the root cause of this excessive demand. Some models failing to setup on A100 hardware https://replicatestatus.com/incident/515215 Wed, 19 Feb 2025 01:49:00 -0000 https://replicatestatus.com/incident/515215#e0d2b7f09da66cafa64041e57db25e0b1f0dd636309b5b9eec38012576022734 The rollback of the suspected misconfiguration is complete and all queues have recovered. Thanks for your patience! Some models failing to setup on A100 hardware https://replicatestatus.com/incident/515215 Wed, 19 Feb 2025 01:37:00 -0000 https://replicatestatus.com/incident/515215#2fec75079501e78c7e4d90d0fd271ed1268d869219565f934b40db226a8f53cf We are seeing escalated setup failures for a subset of models on the A100 hardware type that are using a "fast boot" load method that is failing due to an access misconfiguration. We are rolling back the suspected change and we are watching closely. Homepage recovered https://replicatestatus.com/ Tue, 11 Feb 2025 12:21:02 +0000 https://replicatestatus.com/#d85d3aab8867bcdca3041d8c6044162d03018660356a2f2bc3bfdb3a50c1a296 Homepage recovered Homepage went down https://replicatestatus.com/ Tue, 11 Feb 2025 12:19:06 +0000 https://replicatestatus.com/#d85d3aab8867bcdca3041d8c6044162d03018660356a2f2bc3bfdb3a50c1a296 Homepage went down Predictions degraded for L40S, H100, and CPU hardware types https://replicatestatus.com/incident/509110 Fri, 07 Feb 2025 18:55:00 -0000 https://replicatestatus.com/incident/509110#50e3e2fc2d6f7421f819c1d46b487d44f342ed88b583ec04ebd7e2c06586af72 We are now caught up and running below capacity. Thanks for your patience! Predictions degraded for L40S, H100, and CPU hardware types https://replicatestatus.com/incident/509110 Fri, 07 Feb 2025 18:30:00 -0000 https://replicatestatus.com/incident/509110#2a3959c56a4a4a373a2b6d90d5367d593790b5991c820a69a222aee2e9597da0 We are currently running at capacity. Most queues have caught up, but the possibility of delays still exists, so we will keep this incident open in a "degraded" state. Predictions degraded for L40S, H100, and CPU hardware types https://replicatestatus.com/incident/509110 Fri, 07 Feb 2025 18:07:00 -0000 https://replicatestatus.com/incident/509110#b25d0e83f6dc56c2c1f58199f1e2760641433a04aba1237375057ec67d17f930 We have cleaned up all of the models that were crashing or locked up, and we are now scaled out to max capacity while working through queue backlogs. Predictions degraded for L40S, H100, and CPU hardware types https://replicatestatus.com/incident/509110 Fri, 07 Feb 2025 17:40:00 -0000 https://replicatestatus.com/incident/509110#cf422c71c7fb590a946e76cb9e5db723f11e76cae71dfe4c502cc1f755c092f8 The majority of the delays we are seeing right now are due to models not setting up, which is likely due to a combination of configuration changes that clearly are not working as intended. We reverted the configuration changes and now we are in the process of cleaning up models that are crash looping or locked up, and starting to see capacity recover. API instability for L40S, H100 and CPU workloads. https://replicatestatus.com/incident/508398 Thu, 06 Feb 2025 15:20:00 -0000 https://replicatestatus.com/incident/508398#765b8c6a00ae4479910b3ccd6073734a2a8a837f6f07af0c65f7e58ca5339acf This issue appears to have been a result of another bandwidth spike partly as a result of [our incident earlier today](https://replicatestatus.com/incident/508200). The issue has now been resolved. We are going to be working to prevent incidents of this kind from recurring. API instability for L40S, H100 and CPU workloads. https://replicatestatus.com/incident/508398 Thu, 06 Feb 2025 15:05:00 -0000 https://replicatestatus.com/incident/508398#3880f4e19cc44c3f3d38c7f00cbc2625abc257a496886edd268f561b141829d0 Requests for predictions made on models and deployments running on H100 and L40S instances are taking a long time to respond and sometimes timing out. We're trying to establish the cause of the issues. Setup failures on L40S and H100 hardware https://replicatestatus.com/incident/508200 Thu, 06 Feb 2025 10:45:00 -0000 https://replicatestatus.com/incident/508200#34b6f32ea26cc70bd5209018527ec226315842b945af0fd41287caf5ec01dbac This incident is now resolved. Setup failures on L40S and H100 hardware https://replicatestatus.com/incident/508200 Thu, 06 Feb 2025 10:27:00 -0000 https://replicatestatus.com/incident/508200#0367d6464b279b98937bb3c4fa261e37be1cff2115bf5feeb7eccc045406c314 Most systems are now operating normally again. We are continuing to monitor the situation. Setup failures on L40S and H100 hardware https://replicatestatus.com/incident/508200 Thu, 06 Feb 2025 10:06:00 -0000 https://replicatestatus.com/incident/508200#a883edf56fc0ae85ae24711d9bb838a5bb0844cad8e462b3767d55adc7b74197 As some of you may have noticed, things got worse before they got better. When the upstream storage provider restored service, models pending setup resulted in a large bandwidth surge. We're currently managing the effects of that surge, which has affected the speed of predictions and prediction webhook delivery. Setup failures on L40S and H100 hardware https://replicatestatus.com/incident/508200 Thu, 06 Feb 2025 09:09:00 -0000 https://replicatestatus.com/incident/508200#01dc49fe4a94b5f60bc1aa2af445be645c29e47cf77c454da82982d2d242f358 We've identified the underlying problem -- a storage outage at an upstream provider -- and are investigating paths to mitigate the impact of the upstream outage. Setup failures on L40S and H100 hardware https://replicatestatus.com/incident/508200 Thu, 06 Feb 2025 08:58:00 -0000 https://replicatestatus.com/incident/508200#102b6819b827d0808612810dea7f146cb79176b25708a4cc9f36de9565f58e36 We're investigating an issue that's preventing some models running on L40S hardware from successfully completing setup. We'll update when we have more information. Prediction creation unavailable for L40S and H100 hardware https://replicatestatus.com/incident/506674 Mon, 03 Feb 2025 06:30:00 -0000 https://replicatestatus.com/incident/506674#8a21a3d79f00eb748e79e748891db3182e9ddef2b008aca99e530c5c08314744 The cache used by the API for predictions was misconfigured for a period of ~20 minutes beginning at 20:34 UTC until a rollback completed at 20:56 UTC. Models using the L40S and H100 hardware types were affected. During the period of misconfiguration, prediction creation was severely limited, resulting in many API responses with status 503. Instability and delays for H100 and L40S https://replicatestatus.com/incident/504337 Thu, 30 Jan 2025 09:56:00 -0000 https://replicatestatus.com/incident/504337#3a6dcaf9bee72f2ac32306320bd9f9d1955cc8202ddb2e0ae8139da0f7e4e26b The networking issue with our provider was resolved at 0940 UTC, and all requests have been running normally since then. Instability and delays for H100 and L40S https://replicatestatus.com/incident/504337 Thu, 30 Jan 2025 09:56:00 -0000 https://replicatestatus.com/incident/504337#3a6dcaf9bee72f2ac32306320bd9f9d1955cc8202ddb2e0ae8139da0f7e4e26b The networking issue with our provider was resolved at 0940 UTC, and all requests have been running normally since then. Instability and delays for H100 and L40S https://replicatestatus.com/incident/504337 Thu, 30 Jan 2025 09:24:00 -0000 https://replicatestatus.com/incident/504337#6ae9fd938076e4a53d2f9039a11e225fee0a80e3dd2434005eb51d14a04f1ad1 This appears to be an issue with the provider for our H100 and L40S cluster. We're working with our provider to resolve it. Otherware hardware types are unaffected. Instability and delays for H100 and L40S https://replicatestatus.com/incident/504337 Thu, 30 Jan 2025 09:24:00 -0000 https://replicatestatus.com/incident/504337#6ae9fd938076e4a53d2f9039a11e225fee0a80e3dd2434005eb51d14a04f1ad1 This appears to be an issue with the provider for our H100 and L40S cluster. We're working with our provider to resolve it. Otherware hardware types are unaffected. Instability and delays for H100 and L40S https://replicatestatus.com/incident/504337 Thu, 30 Jan 2025 09:16:00 -0000 https://replicatestatus.com/incident/504337#6f956e9efe80ad68b6822509f80e5ed0e5f55a8825e22ee617ddfb17030a0671 Requests for predictions made on models and deployments running on H100 and L40S instances are taking a long time to respond and sometimes timing out. We're trying to establish the cause of the issues. Instability and delays for H100 and L40S https://replicatestatus.com/incident/504337 Thu, 30 Jan 2025 09:16:00 -0000 https://replicatestatus.com/incident/504337#6f956e9efe80ad68b6822509f80e5ed0e5f55a8825e22ee617ddfb17030a0671 Requests for predictions made on models and deployments running on H100 and L40S instances are taking a long time to respond and sometimes timing out. We're trying to establish the cause of the issues. Billing and metric delays https://replicatestatus.com/incident/496331 Wed, 15 Jan 2025 16:29:00 -0000 https://replicatestatus.com/incident/496331#94051b82dcdb502920a332f930c2a46682c06a8001bb95095ffa114b7bcbf5ce The background jobs are running again, and we've caught up to present as of about 1609 UTC (20 minutes ago). Billing and metric delays https://replicatestatus.com/incident/496331 Wed, 15 Jan 2025 15:25:00 -0000 https://replicatestatus.com/incident/496331#cd0a3f20876e1de1646787edaad9d465bf09bff03ca1e61ad21aeac5dde506e6 Most of our background jobs – including billing updates, metric ingestion and automatic deletion of prediction data – stopped at about 12.17 UTC (a little over 3 hours ago). We're working on getting them running again. Prediction workloads should be unaffected. Homepage recovered https://replicatestatus.com/ Wed, 15 Jan 2025 13:30:45 +0000 https://replicatestatus.com/#c29c3e9e318da752f2fd37cf9911f8c66fca67bf23a985478673799a29a58c7d Homepage recovered Homepage went down https://replicatestatus.com/ Wed, 15 Jan 2025 13:18:04 +0000 https://replicatestatus.com/#c29c3e9e318da752f2fd37cf9911f8c66fca67bf23a985478673799a29a58c7d Homepage went down Dashboard inaccessible due to redirect https://replicatestatus.com/incident/493358 Thu, 09 Jan 2025 16:37:00 -0000 https://replicatestatus.com/incident/493358#3ac92a1bda2eff0bc26c40c59bb3b43a936b6ed6bb4993b5e390a81dca8c9468 The redirect has been reverted and the dashboard should be accessible again. Dashboard inaccessible due to redirect https://replicatestatus.com/incident/493358 Thu, 09 Jan 2025 16:29:00 -0000 https://replicatestatus.com/incident/493358#03e145d3eaea8627ad8f93afe44a8e369b182236218a18ab11849ab4a650c0ea We are seeing consistent redirects from the dashboard to the homepage and are working to roll out a fix. Homepage recovered https://replicatestatus.com/ Thu, 09 Jan 2025 16:17:46 +0000 https://replicatestatus.com/#4b01ed079795574de1f024b4604260e304203ed31f11f2e0820aedc8c9fac55b Homepage recovered Homepage went down https://replicatestatus.com/ Thu, 09 Jan 2025 16:16:48 +0000 https://replicatestatus.com/#4b01ed079795574de1f024b4604260e304203ed31f11f2e0820aedc8c9fac55b Homepage went down Data deletion delayed https://replicatestatus.com/incident/476618 Sat, 14 Dec 2024 13:28:00 -0000 https://replicatestatus.com/incident/476618#bafac6d3687724b8fd4290eab6051eabc8d988a893fd2ca9f754c5dc7e03c3f3 We've caught up with prediction deletion, and our system is once again deleting predictions on time. L40s temporary stock out https://replicatestatus.com/incident/477418 Sat, 14 Dec 2024 00:41:00 -0000 https://replicatestatus.com/incident/477418#31b5738fa12cf3f3ef10d322a572377cb6a9e016fcdee2a320ac8471d44f35dc At 22.15 UTC Jan 13 an issue forced us to shift some GPU workloads, which caused stock outs leading to increased wait times to spin up new model instances using L40s. The work has completed and GPUs are now available as normal as of 00.15 UTC Jan 15th. Data deletion delayed https://replicatestatus.com/incident/476618 Thu, 12 Dec 2024 14:14:00 -0000 https://replicatestatus.com/incident/476618#e0e542b71dd4590a02996385353224f331c25b17e4dbc25909a4874074534857 Automatic deletion of data for predictions was stopped for predictions created after 2024-12-05. We've fixed the issue and are deleting data again, but it'll take some time for us to process all the predictions that are awaiting deletion. T4 predictions unavailable https://replicatestatus.com/incident/476191 Wed, 11 Dec 2024 20:27:00 -0000 https://replicatestatus.com/incident/476191#359b2317ab266d2651fcbedf15160af8377e3a2081e20e7cdcc952707abb3871 T4 predictions were unavailable approximately between the hours of 1800 and 2027 UTC. We found an issue with the nvidia driver installation on our T4 hardware targets. This only affected predictions running against the T4 hardware. We have deployed a fix and are backfilling the outstanding predictions. A40 Hardware Network Maintenance https://replicatestatus.com/incident/472508 Thu, 05 Dec 2024 22:00:46 +0000 https://replicatestatus.com/incident/472508#79160327a833096390c8282867c6616c6ea820545aba798f041fbd171d12aa10 Maintenance completed A40 Hardware Network Maintenance https://replicatestatus.com/incident/472508 Thu, 05 Dec 2024 14:00:46 -0000 https://replicatestatus.com/incident/472508#888ae65956e5664798abb671fe48255d3b1995b90b7acdc052d469b683e7bcfc The region hosting the A40 class hardware will have network maintenance performed starting December 5th, 2024 at 2:00PM UTC with an estimated completion of December 5th, 2024 at 10:00PM UTC. During this maintenance window we are expecting intermittent disruption of network connectivity. This only impacts the A40 class hardware. We appreciate your patience and understanding. API errors and request delays https://replicatestatus.com/incident/468016 Tue, 26 Nov 2024 19:16:00 -0000 https://replicatestatus.com/incident/468016#8ac70a22e4825701966153d8ca44c80d55be7e82fb104049d7c70a14b4d0772b We're seeing healthy behavior since our upstream provider applied further fixes in the last hour. We will be sharing further details of how this happened once they are available. API errors and request delays https://replicatestatus.com/incident/468016 Tue, 26 Nov 2024 17:16:00 -0000 https://replicatestatus.com/incident/468016#d747b69e716aee9d91b5fe8f41a501b8b96bf5471f88d8869241f90a4a131ffa The partner we're working with on this issue has shared with us that they are struggling to manage extremely high bandwidth to some of their systems and this is causing the impact which is affecting Replicate and our customers. If you're affected and can change your models or deployments to run on other hardware (such as our newly-added L40S GPUs) that will mitigate the impact you're seeing, as this only impacts A100 GPUs. API errors and request delays https://replicatestatus.com/incident/468016 Tue, 26 Nov 2024 16:30:00 -0000 https://replicatestatus.com/incident/468016#79d3c62849a23a09f3308f08e0365b394dedffd9d32b02f82b3ebacc5a0be806 We've noticed that the fix previously applied appears to have regressed. We've escalated this issue and will provide an update as soon as we have one. API errors and request delays https://replicatestatus.com/incident/468016 Tue, 26 Nov 2024 15:57:00 -0000 https://replicatestatus.com/incident/468016#092b9096f27728fd4248f598ceb51386f09b0ceb9d5ecde17f5843c6901c2a3c As of a few minutes ago we believe the underlying issues here have been resolved. We don't fully understand the nature of the problem yet but will be following up with our partners to make sure we (and they) do. API errors and request delays https://replicatestatus.com/incident/468016 Tue, 26 Nov 2024 14:32:00 -0000 https://replicatestatus.com/incident/468016#a89c605cabd33094583c38d5227860768b5826cde392dae3b0bc01bbca30e7a0 We're continuing to investigate this issue, and are aware of the inconvenience this may be causing. We ask for your patience as we work with our infrastructure providers to identify the source of the disruption. API errors and request delays https://replicatestatus.com/incident/468016 Tue, 26 Nov 2024 12:28:00 -0000 https://replicatestatus.com/incident/468016#b1f79f5427da179791c1611ec9535afda456a06656703a6cb0fd6468ad4b3216 We're aware of an issue affecting A100 hardware types which is causing delays and error responses from our API. We are investigating the issue and will provide an update when we have more information. Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:05:47 +0000 https://replicatestatus.com/incident/464309#825e963f4f6da08a196f0b4ef19e5f7b8e390b74b82e4aa3d86ebe7ac1c153f3 Maintenance completed Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:05:47 -0000 https://replicatestatus.com/incident/464309#ad25b7927d376861be256555266fe2fc5ca159aa0fd83bbbcc99eb8387db75bd We have identified an issue with the streaming api endpoints and will be releasing an emergency fix. This will cause a disruption in streaming API services. We expect the disruption to be relatively short, but it could be as much as 30 minutes. Updates will be provided during this maintenance and when it completes. This impacts streaming endpoints for H100, A40, and A100 hardware types. Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:05:47 +0000 https://replicatestatus.com/incident/464309#825e963f4f6da08a196f0b4ef19e5f7b8e390b74b82e4aa3d86ebe7ac1c153f3 Maintenance completed Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:05:47 -0000 https://replicatestatus.com/incident/464309#ad25b7927d376861be256555266fe2fc5ca159aa0fd83bbbcc99eb8387db75bd We have identified an issue with the streaming api endpoints and will be releasing an emergency fix. This will cause a disruption in streaming API services. We expect the disruption to be relatively short, but it could be as much as 30 minutes. Updates will be provided during this maintenance and when it completes. This impacts streaming endpoints for H100, A40, and A100 hardware types. Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:00:46 -0000 https://replicatestatus.com/incident/464309#ad25b7927d376861be256555266fe2fc5ca159aa0fd83bbbcc99eb8387db75bd We have identified an issue with the streaming api endpoints and will be releasing an emergency fix. This will cause a disruption in streaming API services. We expect the disruption to be relatively short, but it could be as much as 30 minutes. Updates will be provided during this maintenance and when it completes. This impacts streaming endpoints for H100, A40, and A100 hardware types. Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:00:46 -0000 https://replicatestatus.com/incident/464309#ad25b7927d376861be256555266fe2fc5ca159aa0fd83bbbcc99eb8387db75bd We have identified an issue with the streaming api endpoints and will be releasing an emergency fix. This will cause a disruption in streaming API services. We expect the disruption to be relatively short, but it could be as much as 30 minutes. Updates will be provided during this maintenance and when it completes. This impacts streaming endpoints for H100, A40, and A100 hardware types. Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:00:17 -0000 https://replicatestatus.com/incident/464309#ad25b7927d376861be256555266fe2fc5ca159aa0fd83bbbcc99eb8387db75bd We have identified an issue with the streaming api endpoints and will be releasing an emergency fix. This will cause a disruption in streaming API services. We expect the disruption to be relatively short, but it could be as much as 30 minutes. Updates will be provided during this maintenance and when it completes. This impacts streaming endpoints for H100, A40, and A100 hardware types. Emergency Maintenance Streaming API https://replicatestatus.com/incident/464309 Tue, 19 Nov 2024 20:00:17 -0000 https://replicatestatus.com/incident/464309#ad25b7927d376861be256555266fe2fc5ca159aa0fd83bbbcc99eb8387db75bd We have identified an issue with the streaming api endpoints and will be releasing an emergency fix. This will cause a disruption in streaming API services. We expect the disruption to be relatively short, but it could be as much as 30 minutes. Updates will be provided during this maintenance and when it completes. This impacts streaming endpoints for H100, A40, and A100 hardware types. Flux Dev Inference Delays https://replicatestatus.com/incident/463752 Tue, 19 Nov 2024 01:11:00 -0000 https://replicatestatus.com/incident/463752#79851d5240581589ea7c4f4ad26627d507daec82187ce717527711359d57b098 The back log of predictions has been worked through and we are seeing normal prediction times return. Thank you for your patience. Flux Dev Inference Delays https://replicatestatus.com/incident/463752 Tue, 19 Nov 2024 01:06:00 -0000 https://replicatestatus.com/incident/463752#d2f8ab0434980cc6f9cd59d0ac479464b5a86b2e6048916d6755c8328f2f8086 The flux dev instances have been restarted and are seeing prediction processing increase. The back log of queued predictions are being processed. We expect prediction times to return to normal within a relatively short window. Flux Dev Inference Delays https://replicatestatus.com/incident/463752 Tue, 19 Nov 2024 00:30:00 -0000 https://replicatestatus.com/incident/463752#c78958ee8e59cef73f86b32bc5e552951d6c0a7c309f5753b16201f272e50bae We are aware of a significant delay in pickup of Flux Dev Predictions. Our engineering team is looking into the issue. We will provide updates as they become available. Flux Dev Prediction Delays https://replicatestatus.com/incident/462887 Sun, 17 Nov 2024 02:35:00 -0000 https://replicatestatus.com/incident/462887#d2f684870dcc04c91945d2c0b0749b78d1ed6d241b0541bfd080e10a21453a9b A of 0223 UTC Nov 17th the backlog has been processed and Flux Dev is handling requests as expected. Flux Dev Prediction Delays https://replicatestatus.com/incident/462887 Sun, 17 Nov 2024 02:05:00 -0000 https://replicatestatus.com/incident/462887#7c5b51fcb458aebdba02580f0113562705867c1867dfee15124afa5943f0042e Starting at about 19:42 UTC on November 16 2024 the Flux Dev model started to have significant delays in processing predictions. Over the course of ~5 hours the predictions showed a longer delay being picked up for processing. At 0159 UTC on November 17 2024, the instances were administratively restarted to provide relief for the building queue. At this time we have added ~20 instances and the queue backlog is rapidly being processed. Only the Flux Dev model is impacted by this issue. Predictions failing for H100 hardware https://replicatestatus.com/incident/461186 Thu, 14 Nov 2024 03:19:00 -0000 https://replicatestatus.com/incident/461186#0e0f210ff7ff3e6c60e64002932520e2b5edbf12932183f8234fc9b507688bfa We have identified a hardware failure and have isolated the affected node(s). We are seeing a return to normal service for H100-targeted predictions and trainings. Predictions failing for H100 hardware https://replicatestatus.com/incident/461186 Thu, 14 Nov 2024 03:13:00 -0000 https://replicatestatus.com/incident/461186#77bf92978f6f6ae3819bc0d68bb489a63ffa2d85e660889cfbee913e7a48e340 Predictions and trainings targeting h100 hardware are currently failing to create. Our engineers are working on identifying the source of these failures and will provide updates as information becomes available. This incident impacts all h100-class hardware targets. Flux Pro, Recraft, and Ideogram failed predictions https://replicatestatus.com/incident/459589 Mon, 11 Nov 2024 17:10:00 -0000 https://replicatestatus.com/incident/459589#3a2929ae0be5b2eee378244554b5274ff5b0ae068b31693384f62d355c752707 We identified an internal component that caused errors with Flux Pro, Recraft, and Ideogram models. The errors occurred between approximately 1540 UTC and 1709 UTC on November 11, 2024. As of the time this status update is published, the internal component has been rolled back and we are seeing normal prediction handling for impacted models. Prediction delays for black-forest-labs/flux-1.1-pro and meta/meta-llama-3-70b-instruct https://replicatestatus.com/incident/458266 Fri, 08 Nov 2024 19:05:00 -0000 https://replicatestatus.com/incident/458266#50268e23eb3fd448fd3279dbc018cb96f5311cb21033d33037abef78ac0d65ee Between the hours of 16:30 - 19:00 UTC, predictions sent to [flux-1.1-pro](https://replicate.com/black-forest-labs/flux-1.1-pro) and [meta-llama-3-70b-instruct](https://replicate.com/meta/meta-llama-3-70b-instruct) were delayed by up to 1 hour. This was the result of a rollout of an internal component that broke a small number of models which we then rolled back. Due to the high volume of predictions handled by these two models, the backlog grew fairly quickly, but we have now caught up with all queued predictions. H100 Hardware Queueing https://replicatestatus.com/incident/457107 Wed, 06 Nov 2024 20:34:00 -0000 https://replicatestatus.com/incident/457107#c7b42f376e4f3ddd8eda35f9df9ed8321a4f662c56221602393777383feddc95 After evaluation the queues impacted (predictions submitted prior to migration to the alternate region) are being truncated. You will not be billed for predictions that are dropped in this manner, however, the predictions may appear as "in process" or "queued" for a period of time until the platform automation identifies them as dropped. It is safe to cancel and/or resubmit predictions impacted in this manner. This truncation impacts a few thousand total predictions across all models targeting H100 hardware type. Additionally Flux Fine Tune predictions are not being truncated in this manner and will continue to process the backlog. H100 Hardware Queueing https://replicatestatus.com/incident/457107 Wed, 06 Nov 2024 20:12:00 -0000 https://replicatestatus.com/incident/457107#ab6e02b52dcabd21f6b2ea73aac523d434c95ef24d20cf2d155b6d88c774c3ee All new predictions for H100-class hardware will now be routed to the alternate region. Past predictions that are impacted by this outage may see significant delays for processing. We are working to address the large queue buildup prior to moving to the new region. H100 Hardware Queueing https://replicatestatus.com/incident/457107 Wed, 06 Nov 2024 19:48:00 -0000 https://replicatestatus.com/incident/457107#53c974ab1bccb8dd2dfabdb877c3c85cdb621698d09cf2272c4e25b843837e1a Approximately 50% of all h100 traffic has redirected to our alternate region. We are working to shift the rest of the prediction workloads as quickly as possible. H100 Hardware Queueing https://replicatestatus.com/incident/457107 Wed, 06 Nov 2024 19:29:00 -0000 https://replicatestatus.com/incident/457107#59566e58d311ead7f407b60953c835b1c6645674b4ae21bb147b4a74219ce47c The impact of this incident encompasses all workloads targeting h100 hardware classes: Flux Fine Tunes Flux Dev (migrated; new predictions not impacted) Flux Schnell (migrated; new predictions not impacted) Stable Diffusion 3.5 (all variants) bytedance/hyper-flux-16step (list above is not all inclusive) We are actively migrating all workloads to additional capacity to alleviate the problems. Updates will be provided as each model is migrated. H100 Hardware Queueing https://replicatestatus.com/incident/457107 Wed, 06 Nov 2024 19:11:00 -0000 https://replicatestatus.com/incident/457107#b5a78ba400160364f164aff8128537c5c27483d662cfa7ff0ce8f1b6c00f140a We have moved new traffic for flux schnell to additional capacity in another region. New predictions for flux schnell will be processed within the expected timeframes. The backlog of predictions will continue to be processed. Flux Dev traffic will be migrated soon. H100 Hardware Queueing https://replicatestatus.com/incident/457107 Wed, 06 Nov 2024 18:48:00 -0000 https://replicatestatus.com/incident/457107#90687a4531afb6cacd5a4b36e7398944837fd8f8048ecf271b907c373a36687f We are seeing a rapid buildup of queued predictions to flux-dev and flux-schnell models. File streaming not working https://replicatestatus.com/incident/456919 Wed, 06 Nov 2024 11:27:00 -0000 https://replicatestatus.com/incident/456919#3a1ddb31ab03a13484f765b259fee820ef8e37faa05d5bc8fb79227569aa6e01 Our fix has rolled out and file output streaming is now working again. File streaming not working https://replicatestatus.com/incident/456919 Wed, 06 Nov 2024 11:20:00 -0000 https://replicatestatus.com/incident/456919#6a54087822be4608d648f33c5056050cda9c78e9bbeb72998943b53eff87dc19 We're aware of a problem impacting streaming file outputs for models that support it. This will also affect file outputs returned by the synchronous API. We believe we understand the problem and are in the process of rolling out a fix. Autoscaling Impacted for A40 and A100 class GPUs https://replicatestatus.com/incident/452760 Tue, 29 Oct 2024 22:05:00 -0000 https://replicatestatus.com/incident/452760#65fa08add7dbc11978a890a876f69ebd1091c87c5d8f86ab6315267cc3299367 A40 scaling has returned to normal time frames. Autoscaling Impacted for A40 and A100 class GPUs https://replicatestatus.com/incident/452760 Tue, 29 Oct 2024 21:15:00 -0000 https://replicatestatus.com/incident/452760#aaebe09c04cb1ab8f4dd34671d766a2782c37cd120c2b87384f06a5fbcbb8c13 A100 class GPUs are no longer affected. We are working to restore normal scaling timeframes for A40-class GPUs Autoscaling Impacted for A40 and A100 class GPUs https://replicatestatus.com/incident/452760 Tue, 29 Oct 2024 21:05:00 -0000 https://replicatestatus.com/incident/452760#6fee3b8f26e379ae0ae8cd08433008791ae1630f230a4d67bb1bcdbca80d37f7 The source of the scaling delays has been identified and a fix is being rolled out. We will provide an update as progress for the rollout continues. Autoscaling Impacted for A40 and A100 class GPUs https://replicatestatus.com/incident/452760 Tue, 29 Oct 2024 20:45:00 -0000 https://replicatestatus.com/incident/452760#0f1faf423bd504c19829fc490e58e2900f5509bef4ef092b5966e3f6a0f5643b We are aware of an issue causing scaling to be significantly delayed for all a40 and A100 class GPU models. Analysis of the source of the impact is under way and we will provide updates as information becomes available. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 17:46:00 -0000 https://replicatestatus.com/incident/452646#1bc321b453cba8d374a779228328a0ebd7b9ab2c8370fffe9a220169e528464f We have moved traffic for the models impacted (H100 target hardware) back to the H100 class GPUs. Predictions and trainings targeting H100 class GPUs have returned to normal. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 17:46:00 -0000 https://replicatestatus.com/incident/452646#1bc321b453cba8d374a779228328a0ebd7b9ab2c8370fffe9a220169e528464f We have moved traffic for the models impacted (H100 target hardware) back to the H100 class GPUs. Predictions and trainings targeting H100 class GPUs have returned to normal. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 16:10:00 -0000 https://replicatestatus.com/incident/452646#80003e98a7d0ff0149b3fadfa7594237c648ee248a15bbd351a6e510ffa08e63 Predictions on flux-dev are now also running in a different cluster. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 16:10:00 -0000 https://replicatestatus.com/incident/452646#80003e98a7d0ff0149b3fadfa7594237c648ee248a15bbd351a6e510ffa08e63 Predictions on flux-dev are now also running in a different cluster. replicate.com database maintenance https://replicatestatus.com/incident/452538 Tue, 29 Oct 2024 16:00:08 +0000 https://replicatestatus.com/incident/452538#59cab564ac717bb32905df1b5042f701d468e742a4d26ed6560854c27ddce132 Maintenance completed H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 15:45:00 -0000 https://replicatestatus.com/incident/452646#ae722b42d44d1f95f880164bc76c155e0a68697ad0c7349a419d4dfb9012de8c Predictions on flux-schell and flux fine tunes are successfully running in another cluster. Predictions on flux-dev are still not working. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 15:45:00 -0000 https://replicatestatus.com/incident/452646#ae722b42d44d1f95f880164bc76c155e0a68697ad0c7349a419d4dfb9012de8c Predictions on flux-schell and flux fine tunes are successfully running in another cluster. Predictions on flux-dev are still not working. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 15:37:00 -0000 https://replicatestatus.com/incident/452646#997ff3fa9ae27f26d8e735b3b0abab845a19d890eec644ec89708019bdf5ea0e We're moved flux models and fine tunes to run in a different cluster until we can get this cluster back online. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 15:37:00 -0000 https://replicatestatus.com/incident/452646#997ff3fa9ae27f26d8e735b3b0abab845a19d890eec644ec89708019bdf5ea0e We're moved flux models and fine tunes to run in a different cluster until we can get this cluster back online. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 15:28:00 -0000 https://replicatestatus.com/incident/452646#e8035f0904b3b4e0dbcc6e6a56bbb3ff6a98fe33f6b7afe5a72630b8f4940f73 One of our clusters is currently down. We know the immediate cause, and are working on fixing it. This is the cluster that runs our H100s, so all H100 models are currently down, including flux and flux fine tunes. H100 model serving down https://replicatestatus.com/incident/452646 Tue, 29 Oct 2024 15:28:00 -0000 https://replicatestatus.com/incident/452646#e8035f0904b3b4e0dbcc6e6a56bbb3ff6a98fe33f6b7afe5a72630b8f4940f73 One of our clusters is currently down. We know the immediate cause, and are working on fixing it. This is the cluster that runs our H100s, so all H100 models are currently down, including flux and flux fine tunes. replicate.com database maintenance https://replicatestatus.com/incident/452538 Tue, 29 Oct 2024 10:00:08 -0000 https://replicatestatus.com/incident/452538#cbf868ad25b64e8fbaaf2b31fb693033fc2ade1a3e6386730c186b12905ca0ff We're performing some database maintenance on the replicate.com website. This may cause errors or slow responses for some requests for up to 30 minutes. Most requests should be unaffected. Instability and delays for H100s https://replicatestatus.com/incident/451288 Sun, 27 Oct 2024 06:52:00 -0000 https://replicatestatus.com/incident/451288#6156eb6b1ae149702bdc11890af0e924315d5ea362b0dd5f867e8e58039dcaf8 We're processing work without problems once again (as of about 06:30UTC) but are continuing to investigate the source of instability. Instability and delays for H100s https://replicatestatus.com/incident/451288 Sun, 27 Oct 2024 06:31:00 -0000 https://replicatestatus.com/incident/451288#82b3131024bc9c34074468abf28ad3c02937c60fbb3df6ff048977a2cfd83f48 We're investigating some signals of instability and delays processing work for H100 hardware. Some models failing to run predictions https://replicatestatus.com/incident/449862 Thu, 24 Oct 2024 07:00:00 -0000 https://replicatestatus.com/incident/449862#48eaf852e9bb1ca8bd84f6c02d2bb02477ab71f3c116139e1357f24345807841 [This is a retrospective status update published at 08:00 UTC] We identified the problem -- we had rolled out a version of cog to our serving cluster that reintroduced a bug we'd previously fixed -- and have now rolled back that change. Some models failing to run predictions https://replicatestatus.com/incident/449862 Thu, 24 Oct 2024 00:00:00 -0000 https://replicatestatus.com/incident/449862#7f055dc092b20f011e1cbc1f79993020713ef417be3e8f5f6ea1d188e18fc6b0 [This is a retrospective status update published at 08:00 UTC] Between about 00:00 UTC and 07:00 UTC on 24 Oct 2024, a small number of models will have stopped working. Predictions on these models may have errored with "Prediction timed out" or other generic errors. We rolled out a version of cog to our serving cluster that reintroduced a bug we'd previously fixed. This change has been rolled back. We'd like to acknowledge that this is not the first time a cog update has broken some subset of models running on Replicate. We know this isn't acceptable and we will be working to change how these rollouts work. Thank you for your patience and understanding. Autoscaling impacted for A100 class hardware https://replicatestatus.com/incident/448821 Tue, 22 Oct 2024 16:18:00 -0000 https://replicatestatus.com/incident/448821#6b5fd939c3ce52fdb106cdc7b617dee9da27cacabe387ca8475d4e947417759c The underlying platform disruption has been resolved. Scaling for A100 Hardware has returned to normal. Autoscaling impacted for A100 class hardware https://replicatestatus.com/incident/448821 Tue, 22 Oct 2024 16:00:00 -0000 https://replicatestatus.com/incident/448821#349ae6c47dea78406f0ea5e1b1515bf4e698aeb2efcf40f356fcd192070e8d04 We have been alerted to an issue impacting the ability to scale instances for the A100 class hardware. The engineering team is monitoring and working to minimize the impact. Not all scaling events will be impacted, but it is expected that there will be a delay in scaling instances on A100 hardware. 503s on Replicate Files API https://replicatestatus.com/incident/446970 Fri, 18 Oct 2024 20:13:00 -0000 https://replicatestatus.com/incident/446970#6d300dcedec74a356ce7045f1db37505f62bf50daa00a2edf51e86d0e96d54ac We have deployed a permanent fix for the 503s for the `files-api` 503s on Replicate Files API https://replicatestatus.com/incident/446970 Fri, 18 Oct 2024 19:26:00 -0000 https://replicatestatus.com/incident/446970#128c9a2d0b86156fc1bb3aa9b8bd509b4f4bf27dff26c7701ef5ee346f2cead2 The source of the 503 errors has been identified and a temporary fix has been put into place. We expect to have a permanent solution shortly. At this time files-api uploads should resume working. We are monitoring the situation and will provide an update once a permanent fix is in place. 503s on Replicate Files API https://replicatestatus.com/incident/446970 Fri, 18 Oct 2024 19:21:00 -0000 https://replicatestatus.com/incident/446970#70a86a331fcca55aa223da9263c42f33501ea11331ca55d6f216d8849da730c8 We have been made aware of failures with the replicate-files-api. This is resulting in 503 errors on file uploads. Engineers are working to identify the source of the 503 errors. We will provide an update as soon as information becomes available. Prediction failures for A100 class hardware https://replicatestatus.com/incident/445919 Thu, 17 Oct 2024 01:51:00 -0000 https://replicatestatus.com/incident/445919#fb9f8503e6c966ccc5983a1b69d892b6b46ae0a8a073facff8e3efe6d37a01a4 We have rolled out a change that resolved the failure cases. Prediction failures for A100 class hardware https://replicatestatus.com/incident/445919 Thu, 17 Oct 2024 01:48:00 -0000 https://replicatestatus.com/incident/445919#70a3217478c8c4cf416160f74cc8dabcb70498d88832f3e440bd96d5005b1aea We have identified an error that has caused a short window of failures in creation and updates for predictions on a100 class hardware. This brief outage lasted for about 5 minutes between October 17th 0140 UTC and 0145 UTC Delay for predictions targeting A100 GPUs https://replicatestatus.com/incident/445249 Wed, 16 Oct 2024 03:27:00 -0000 https://replicatestatus.com/incident/445249#7520271e659a5eb9c62a46d49187cbe27b1b89552280291d592be4aabcc65c3a Our infrastructure provider is continuing to work on a network problem, but mitigations appear to be allowing workloads to flow normally. Delay for predictions targeting A100 GPUs https://replicatestatus.com/incident/445249 Wed, 16 Oct 2024 03:09:00 -0000 https://replicatestatus.com/incident/445249#993cc4d9a52e0caf0fae44ab1376e2245cc9655e1aea0ef159db679b4ed1979a We are seeing improved network performance which is having a positive impact on scaling speed and queue wait times. Delay for predictions targeting A100 GPUs https://replicatestatus.com/incident/445249 Wed, 16 Oct 2024 02:00:00 -0000 https://replicatestatus.com/incident/445249#1c2a2ef91d37bae90ac5272655aa13f85eff356caa028491560fabdadcf82e1a Our provider has made some changes to help mitigate the slow boot times from the A100 hardware instances. Some of these instances have successfully booted and handled predictions. We continue to work towards identifying the underlying issue and correcting it for all A100-class instances. Delay for predictions targeting A100 GPUs https://replicatestatus.com/incident/445249 Wed, 16 Oct 2024 01:18:00 -0000 https://replicatestatus.com/incident/445249#60a9f547a78001edb817d3e182513effea2487072a0dc88bd7a94f51476d32ee We are continuing to work with our infrastructure provider, as we are still seeing much longer delays than expected in resource freeing. We are also investigating a drop in network throughput. Delay for predictions targeting A100 GPUs https://replicatestatus.com/incident/445249 Wed, 16 Oct 2024 00:35:00 -0000 https://replicatestatus.com/incident/445249#d3282435a84600212bc30f522540d6c5cc0f2687e5a0fb21d4a8aa2dbc788922 We are continuing to work with our infrastructure provider to free up resources, and we are beginning to see workloads scale out correctly. Delay for predictions targeting A100 GPUs https://replicatestatus.com/incident/445249 Tue, 15 Oct 2024 23:47:00 -0000 https://replicatestatus.com/incident/445249#c9e5e1bffd4e3b7ab53876566d3d25d7ecb70bbd5fbaf7e7b34d1dd122eeb7f6 Predictions targeting A100 GPUs are delayed, and we are investigating issues with hovering close to capacity due to delayed resource freeing for completed work. Predictions degraded between two data centers https://replicatestatus.com/incident/444589 Tue, 15 Oct 2024 00:38:00 -0000 https://replicatestatus.com/incident/444589#d71a8739f8d7de043edae670f2ac067d0203be3c15d7e5cc8ee4e075e193cd6b We have removed an upstream from our API load balancer that seems to be accounting for all of the timeout errors and we are seeing an immediate improvement. Predictions degraded between two data centers https://replicatestatus.com/incident/444589 Tue, 15 Oct 2024 00:03:00 -0000 https://replicatestatus.com/incident/444589#ca98c1cab132df0ab5af3a29d336589366321f4cf37b516ba453d17b5e2e1775 We are investigating a network degradation between two data centers that is affecting a subset of predictions. Flux-dev fine-tuning outage https://replicatestatus.com/incident/443799 Sun, 13 Oct 2024 12:25:00 -0000 https://replicatestatus.com/incident/443799#c57c2ed5e315427cf3c5ed0c8270ee8744cc728fa16a75a1706396266575a9eb We truncated the queues, so in-flight fine-tunes will not complete and you will not be charged for them. Fine-tunes are now working normally. Flux-dev fine-tuning outage https://replicatestatus.com/incident/443799 Sun, 13 Oct 2024 12:10:00 -0000 https://replicatestatus.com/incident/443799#bd7a110ff9d7a21819b4759bf5da384dd4772f1c962c2f1d20f031675a5d79cb We have resolved the problem with the storage layer. However there is a significant queue built up for fine-tunes so it may take time to get through the backlog of work. Flux-dev fine-tuning outage https://replicatestatus.com/incident/443799 Sun, 13 Oct 2024 11:36:00 -0000 https://replicatestatus.com/incident/443799#a912ed1f377f2cd0094ef8d59382b5106aa43825ace3c32fa8d64ea8d79e55ca We are having issues with fine-tuning of flux-dev models. There was a problem with the storage layer where fine-tuned weights were cached. We are working on fixing it. Issues booting models on A40 hardware https://replicatestatus.com/incident/441880 Wed, 09 Oct 2024 16:34:00 -0000 https://replicatestatus.com/incident/441880#c26ab6a17f7f199e6530f892fa861a53ed5f59ecbe207022af465013b8b1c0c5 At this time all but a handful instances have recovered and prediction serving should be normal for the A40 hardware type. We expect the remaining (low single digit) number of instances to be running within the next few minutes. Issues booting models on A40 hardware https://replicatestatus.com/incident/441880 Wed, 09 Oct 2024 16:18:00 -0000 https://replicatestatus.com/incident/441880#14e944c6959f1e310b0186861be49b44cff66623bfb72b04c61d4d19d243578b We have made a configuration change to circumvent the identified networking issue. We are seeing improvements with A40 boots and working through the backlog of predictions and instance boots. Issues booting models on A40 hardware https://replicatestatus.com/incident/441880 Wed, 09 Oct 2024 15:56:00 -0000 https://replicatestatus.com/incident/441880#e1d08d17804dcedfd268a0edcfe707017ccff9f60955d64c8ebfd95f9b1d3ffa We're still working with our networking provider to identify the root cause. We will continue to update as we learn more. Flux-schnell degraded https://replicatestatus.com/incident/441881 Wed, 09 Oct 2024 15:28:00 -0000 https://replicatestatus.com/incident/441881#112a9765e79ddafebe45bdfa85f9aa22c9ea2230b51e26e624184f1c0e9b5307 The queue is now drained, performance is back to normal levels. Flux-schnell degraded https://replicatestatus.com/incident/441881 Wed, 09 Oct 2024 15:24:00 -0000 https://replicatestatus.com/incident/441881#4cda192f9c40e3fd7421d79a398bfa30c050d8e02243dad561f8ec8eb12a8144 We're burning through the queue rapidly now and expect to clear the blacklog within the next 5-10 minutes. Flux-schnell degraded https://replicatestatus.com/incident/441881 Wed, 09 Oct 2024 15:12:00 -0000 https://replicatestatus.com/incident/441881#5b070d22c203c343b9db65cb0f2e1e8be35883450a2b85713f3c11226780295d We have identified the cause and queues are starting to come down again. Queueing delay remains high while we work through the backlog. Issues booting models on A40 hardware https://replicatestatus.com/incident/441880 Wed, 09 Oct 2024 15:10:00 -0000 https://replicatestatus.com/incident/441880#c6a8dcbd0e8beadfa71a36dab3963cce9240fdf087d141455455f8da0810e6e2 We've tracked this down to a networking issue and we've escalated to our networking provider. We will update as we progress. Flux-schnell degraded https://replicatestatus.com/incident/441881 Wed, 09 Oct 2024 14:45:00 -0000 https://replicatestatus.com/incident/441881#f019f96d40dab38bd4a51e20f1ad185abcaa0119a5f3bb7d0b39524afad8ad58 https://replicate.com/deployments/replicate/official-model-flux-schnell's performance is currently degraded, and significant queues have built up. We are investigating. Issues booting models on A40 hardware https://replicatestatus.com/incident/441880 Wed, 09 Oct 2024 14:45:00 -0000 https://replicatestatus.com/incident/441880#a5a4bc2399b4eaca0c27794b932785586121660aa59f4e7f8b89690b4ffa21af We are seeing issues with pulling model images from our A40 cluster. We are investigating. A40 workloads disrupted https://replicatestatus.com/incident/441608 Wed, 09 Oct 2024 05:45:00 -0000 https://replicatestatus.com/incident/441608#4925607855d74f6b93fd5aa727f0f5bbe10b07584b41fe9fed435d5d506c8787 We've deployed a fix. Things look to be stabilizing but we are continuing to monitor. A40 workloads disrupted https://replicatestatus.com/incident/441608 Wed, 09 Oct 2024 05:45:00 -0000 https://replicatestatus.com/incident/441608#4925607855d74f6b93fd5aa727f0f5bbe10b07584b41fe9fed435d5d506c8787 We've deployed a fix. Things look to be stabilizing but we are continuing to monitor. A40 workloads disrupted https://replicatestatus.com/incident/441608 Wed, 09 Oct 2024 05:34:00 -0000 https://replicatestatus.com/incident/441608#e12334dfa2622742a9a99ab999bd7ac2079b2292cc2045393e7518b12385f256 We are seeing ongoing issues with models running on our A40 hardware type. We are investigating. A40 workloads disrupted https://replicatestatus.com/incident/441608 Wed, 09 Oct 2024 05:34:00 -0000 https://replicatestatus.com/incident/441608#e12334dfa2622742a9a99ab999bd7ac2079b2292cc2045393e7518b12385f256 We are seeing ongoing issues with models running on our A40 hardware type. We are investigating. Prediction serving degraded https://replicatestatus.com/incident/435409 Fri, 27 Sep 2024 00:25:00 -0000 https://replicatestatus.com/incident/435409#7ad7107e21b77414c511b360b1c5ad3b1ed4e3786e639f200de7306a0362062e Upon further investigation we were unable to work through the backlog of predictions. Backlogged predictions have been cancelled. It will take some time before the prediction IDs report failed in the replicate web console. Users may resubmit any of these cancelled predictions. All predictions that have been submitted since the last update at Sep 26 2024 at 11:52pm UTC are unaffected by this cancellation of predictions. Prediction serving degraded https://replicatestatus.com/incident/435409 Thu, 26 Sep 2024 23:52:00 -0000 https://replicatestatus.com/incident/435409#bffd75151ad93934ed46d3a4aee8860a5371e8501f2cd601c3ad2c1af927380c We have shifted new flux fine-tune predictions to another region while we look into the root of the slow prediction processing. At this time previous predictions submitted for Flux fine tunes can be cancelled and resubmitted. Predictions that are not cancelled will eventually process as we work through the backlog. We will provide further information as details become available. Prediction serving degraded https://replicatestatus.com/incident/435409 Thu, 26 Sep 2024 23:39:00 -0000 https://replicatestatus.com/incident/435409#9cbf0c98546cad26cf2a45e53382c34bf43263f09bc74a50a33c6b30d1962a96 Seeing large queues for Flux fine tunes resulting in longer than anticipated times for inference. Currently investigating source and will provide updates. Only affect Flux fine tunes. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 17:15:00 -0000 https://replicatestatus.com/incident/435245#9725a9fed25901a0dcff780b0e168bdbcaab607430887010cafe87783ce8b515 Queues for predictions remain fairly high for [black-forest-labs/flux-schnell](https://replicate.com/black-forest-labs/flux-schnell) and [meta/meta-llama-3.1-405b-instruct](https://replicate.com/meta/meta-llama-3.1-405b-instruct). All other models should be behaving normally. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 17:15:00 -0000 https://replicatestatus.com/incident/435245#9725a9fed25901a0dcff780b0e168bdbcaab607430887010cafe87783ce8b515 Queues for predictions remain fairly high for [black-forest-labs/flux-schnell](https://replicate.com/black-forest-labs/flux-schnell) and [meta/meta-llama-3.1-405b-instruct](https://replicate.com/meta/meta-llama-3.1-405b-instruct). All other models should be behaving normally. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 17:15:00 -0000 https://replicatestatus.com/incident/435245#9725a9fed25901a0dcff780b0e168bdbcaab607430887010cafe87783ce8b515 Queues for predictions remain fairly high for [black-forest-labs/flux-schnell](https://replicate.com/black-forest-labs/flux-schnell) and [meta/meta-llama-3.1-405b-instruct](https://replicate.com/meta/meta-llama-3.1-405b-instruct). All other models should be behaving normally. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 16:28:00 -0000 https://replicatestatus.com/incident/435245#a7e0de0232c429b7ead34a191f477772d13b17f3df6cedf0a25fdf3c252f955b We performed a rollback shortly after identifying the problem and we are seeing most services recovering, however much more slowly than expected. We are also investigating a potential problem with one of our caching layers. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 16:28:00 -0000 https://replicatestatus.com/incident/435245#a7e0de0232c429b7ead34a191f477772d13b17f3df6cedf0a25fdf3c252f955b We performed a rollback shortly after identifying the problem and we are seeing most services recovering, however much more slowly than expected. We are also investigating a potential problem with one of our caching layers. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 16:28:00 -0000 https://replicatestatus.com/incident/435245#a7e0de0232c429b7ead34a191f477772d13b17f3df6cedf0a25fdf3c252f955b We performed a rollback shortly after identifying the problem and we are seeing most services recovering, however much more slowly than expected. We are also investigating a potential problem with one of our caching layers. Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 15:48:00 -0000 https://replicatestatus.com/incident/435245#ed2497e2385b76f763f0d962c17fa4730c474481744ca62433f398b37223b817 We are investigating alerts for website availability Website availability problems https://replicatestatus.com/incident/435245 Thu, 26 Sep 2024 15:48:00 -0000 https://replicatestatus.com/incident/435245#ed2497e2385b76f763f0d962c17fa4730c474481744ca62433f398b37223b817 We are investigating alerts for website availability replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:19:47 +0000 https://replicatestatus.com/incident/433623#0da04cb5bff5b329d76df86f18993b21609242a17825e6bdf1333503e91d333a Maintenance completed replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:19:47 -0000 https://replicatestatus.com/incident/433623#385da0fee768d813489a3ca5f981e0c458d33d3236ca4eb2430432d8977a6711 We're performing some database maintenance on the replicate.com website to upgrade the cluster version. This may cause errors or slow responses for a few minutes, but we expect the total impact to be no more than 5 minutes. replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:19:47 +0000 https://replicatestatus.com/incident/433623#0da04cb5bff5b329d76df86f18993b21609242a17825e6bdf1333503e91d333a Maintenance completed replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:19:47 -0000 https://replicatestatus.com/incident/433623#385da0fee768d813489a3ca5f981e0c458d33d3236ca4eb2430432d8977a6711 We're performing some database maintenance on the replicate.com website to upgrade the cluster version. This may cause errors or slow responses for a few minutes, but we expect the total impact to be no more than 5 minutes. replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:19:47 +0000 https://replicatestatus.com/incident/433623#0da04cb5bff5b329d76df86f18993b21609242a17825e6bdf1333503e91d333a Maintenance completed replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:19:47 -0000 https://replicatestatus.com/incident/433623#385da0fee768d813489a3ca5f981e0c458d33d3236ca4eb2430432d8977a6711 We're performing some database maintenance on the replicate.com website to upgrade the cluster version. This may cause errors or slow responses for a few minutes, but we expect the total impact to be no more than 5 minutes. replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:00:36 -0000 https://replicatestatus.com/incident/433623#385da0fee768d813489a3ca5f981e0c458d33d3236ca4eb2430432d8977a6711 We're performing some database maintenance on the replicate.com website to upgrade the cluster version. This may cause errors or slow responses for a few minutes, but we expect the total impact to be no more than 5 minutes. replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:00:36 -0000 https://replicatestatus.com/incident/433623#385da0fee768d813489a3ca5f981e0c458d33d3236ca4eb2430432d8977a6711 We're performing some database maintenance on the replicate.com website to upgrade the cluster version. This may cause errors or slow responses for a few minutes, but we expect the total impact to be no more than 5 minutes. replicate.com database maintenance https://replicatestatus.com/incident/433623 Tue, 24 Sep 2024 15:00:36 -0000 https://replicatestatus.com/incident/433623#385da0fee768d813489a3ca5f981e0c458d33d3236ca4eb2430432d8977a6711 We're performing some database maintenance on the replicate.com website to upgrade the cluster version. This may cause errors or slow responses for a few minutes, but we expect the total impact to be no more than 5 minutes. replicate.com database maintenance https://replicatestatus.com/incident/430088 Tue, 17 Sep 2024 13:25:43 +0000 https://replicatestatus.com/incident/430088#5251b4815778790335cbcae0ec8c51379f9b1e2bad3a6a41f839339d77270a01 Maintenance completed replicate.com database maintenance https://replicatestatus.com/incident/430088 Tue, 17 Sep 2024 13:25:43 -0000 https://replicatestatus.com/incident/430088#188b6cfd22ef0ee6c1ab82b097d0bd9ffd86bd72bb7467cc0ef4694b9aa85e78 We're performing some database maintenance on the replicate.com website. This may cause errors or slow responses for a few minutes. We expect the total impact to be no more than 5 minutes and will update this page if anything goes wrong. replicate.com database maintenance https://replicatestatus.com/incident/430088 Tue, 17 Sep 2024 12:30:46 -0000 https://replicatestatus.com/incident/430088#188b6cfd22ef0ee6c1ab82b097d0bd9ffd86bd72bb7467cc0ef4694b9aa85e78 We're performing some database maintenance on the replicate.com website. This may cause errors or slow responses for a few minutes. We expect the total impact to be no more than 5 minutes and will update this page if anything goes wrong. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 15:18:00 -0000 https://replicatestatus.com/incident/428744#e8712539b04ebdd512c6dd6d88b7f280598aee7315023ac84422845d7551364b Things have been running normally for at least the last 10 minutes. This incident was -- ironically -- triggered by work we're doing to improve the overall performance and reliability of our primary database. We apologise for the disruption. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 15:18:00 -0000 https://replicatestatus.com/incident/428744#e8712539b04ebdd512c6dd6d88b7f280598aee7315023ac84422845d7551364b Things have been running normally for at least the last 10 minutes. This incident was -- ironically -- triggered by work we're doing to improve the overall performance and reliability of our primary database. We apologise for the disruption. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 14:55:00 -0000 https://replicatestatus.com/incident/428744#8ccb56e931049ed7c46206b3e829a210614b728f5ad28700205ded9fe0315ba7 We're seeing things start to return to normal now, and will continue monitoring the system. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 14:55:00 -0000 https://replicatestatus.com/incident/428744#8ccb56e931049ed7c46206b3e829a210614b728f5ad28700205ded9fe0315ba7 We're seeing things start to return to normal now, and will continue monitoring the system. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 14:44:00 -0000 https://replicatestatus.com/incident/428744#5edddc9d7516cf39d63eb16ea8b19899b5bb977fac3263bb41201dfc13dfc11b We've identified what we think is the problem (we made an important database query far too slow) and are rolling out a fix for the issue. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 14:44:00 -0000 https://replicatestatus.com/incident/428744#5edddc9d7516cf39d63eb16ea8b19899b5bb977fac3263bb41201dfc13dfc11b We've identified what we think is the problem (we made an important database query far too slow) and are rolling out a fix for the issue. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 14:37:00 -0000 https://replicatestatus.com/incident/428744#75324fa43e3c94f135d4aa7828d510f676b07fa79ca7ce19c85ddcb84802bba3 We're looking into problems with the replicate.com website right now. We'll provide an update as soon as we know what's happening. Website unavailable https://replicatestatus.com/incident/428744 Fri, 13 Sep 2024 14:37:00 -0000 https://replicatestatus.com/incident/428744#75324fa43e3c94f135d4aa7828d510f676b07fa79ca7ce19c85ddcb84802bba3 We're looking into problems with the replicate.com website right now. We'll provide an update as soon as we know what's happening. Prediction Service Normal https://replicatestatus.com/incident/422680 Sun, 01 Sep 2024 19:41:00 -0000 https://replicatestatus.com/incident/422680#e34547e128ad4f01766eccf46625c860d0420d7c206ec6cdbdd68f3a819b40ae We were alerted to a potential issue with prediction serving. Upon investigation, one of our providers used to monitor is seeing an outage impacting some automated monitoring. We've taken steps to isolate the problematic monitors while our provider works to resolve the issue. Predictions not running on A40s https://replicatestatus.com/incident/420494 Wed, 28 Aug 2024 06:11:00 -0000 https://replicatestatus.com/incident/420494#810678ad49b39a1b5da4f952a535713f3efe6fcc1fc664e1cab6937afa4934a7 A40 workloads are running again. We're continuing to monitor and investigate the underlying cause. Predictions not running on A40s https://replicatestatus.com/incident/420494 Wed, 28 Aug 2024 05:53:00 -0000 https://replicatestatus.com/incident/420494#50b0445851ad3d195d54105c4f1fd71b37c79837a6de45839fc385f2124abab2 Workloads running on A40s are unavailable. We know the cause and are working to resolve it. Streaming service degraded for A100s https://replicatestatus.com/incident/416996 Wed, 21 Aug 2024 11:04:00 -0000 https://replicatestatus.com/incident/416996#3029922e41c352f2334ef32a07ff597bdb8bd6d12041445527fbc1d3005f4ea5 We believe these problems have now been resolved. Please contact us if you are still seeing issues with streaming from Europe. Streaming service degraded for A100s https://replicatestatus.com/incident/416996 Wed, 21 Aug 2024 09:59:00 -0000 https://replicatestatus.com/incident/416996#96f1e07f181744faacdd782cf8702dcb15cc4eea8fcd662a6828d5abc5578075 After some more digging, we've established that this is likely only affecting customers in Europe, and are looking into whether there are routing problems we are able to address. Streaming service degraded for A100s https://replicatestatus.com/incident/416996 Wed, 21 Aug 2024 09:46:00 -0000 https://replicatestatus.com/incident/416996#fd74ba37afce97b2f26cd8648818d6f07eafcb325aa0b2b7adea446c0252973d We're aware of a degradation in service affecting models that use streaming on A100s. This will look like slow streaming or failures to connect to the streaming service. We're digging into the issue now and will post an update when we have one. A40s degraded https://replicatestatus.com/incident/411439 Fri, 09 Aug 2024 15:58:00 -0000 https://replicatestatus.com/incident/411439#d4016ada59e56891481ed4f30fa04707afccd38aff4f0a1a01bd6be000e4fa72 A40 behavior has been stable for some time now. All systems are green. A40s degraded https://replicatestatus.com/incident/411439 Fri, 09 Aug 2024 15:22:00 -0000 https://replicatestatus.com/incident/411439#74268daeb66a37319eeb4556cf8a16d53daecb93ee3f112ba1acea26e0ee3b3e Models that use A40 hardware are experiencing degraded performance. We identified a problem with our networking configuration and have deployed a fix. We're continuing to monitor. Llama3-70b-chat Delays https://replicatestatus.com/incident/403745 Thu, 25 Jul 2024 23:44:00 -0000 https://replicatestatus.com/incident/403745#0fe25595b0393efe30b5a03d44e8cde8563dd11b51ef2cf0213f40b82f9f84dd This has been resolved and predictions should be handled normally. Llama3-70b-chat Delays https://replicatestatus.com/incident/403745 Thu, 25 Jul 2024 22:06:00 -0000 https://replicatestatus.com/incident/403745#8d548adabc6801300613791fa5b05ef74a146ce82c3f596670c7732ee2e2fa9b We have significantly increased capacity for the Llama3-70b-chat model. All new predictions should be served in expected time frames. We will continue to handle our backlog of predictions before the load spike. We will monitor to ensure there are no further processing spikes in prediction handling. Llama3-70b-chat Delays https://replicatestatus.com/incident/403745 Thu, 25 Jul 2024 21:48:00 -0000 https://replicatestatus.com/incident/403745#5ad4922ce08ba04347e53b1bc939af460963eeb2509c0aa5337e3d11d9c0a343 We have identified a delay in processing predictions for llama3-70b-chat. We are working on expanding capacity to handle the increased load. This only impacts llama3-70b-chat official model Predictions on trained versions not starting https://replicatestatus.com/incident/399582 Wed, 17 Jul 2024 16:36:00 -0000 https://replicatestatus.com/incident/399582#2dcc0ef584da6ac47eeb589c8671ae90bc9a9a06e7e6bcbc9eba5d7a5d20d119 We've fixed the issue and predictions on trained versions are running again. Predictions on trained versions not starting https://replicatestatus.com/incident/399582 Wed, 17 Jul 2024 16:21:00 -0000 https://replicatestatus.com/incident/399582#ebf724e47469d913a72d3e069b68bca261fb881623b548691e87c02a0e45ffa1 We're aware of an issue preventing predictions running on trained versions, and we're working on a fix. Intermittent issues affecting some hardware types https://replicatestatus.com/incident/399075 Tue, 16 Jul 2024 20:16:00 -0000 https://replicatestatus.com/incident/399075#1ed75821a49f23bc46ce5703fd7fc9c045a851055604db7942529a1fe1d1f445 Things are running normally as of about 15 minutes ago. Intermittent issues affecting some hardware types https://replicatestatus.com/incident/399075 Tue, 16 Jul 2024 19:55:00 -0000 https://replicatestatus.com/incident/399075#90cf1e6cd178b5edf0bffe682cb1ede189ef8993bbbf44ce9ed2bbdf93166a5f We know what the problem is and will be deploying fixes within the next few minutes. Intermittent issues affecting some hardware types https://replicatestatus.com/incident/399075 Tue, 16 Jul 2024 19:31:00 -0000 https://replicatestatus.com/incident/399075#eb6c923bc6a2ae5cd8ecfa515d80c4f5ba57415382919c5c2701922a3766aff2 We're aware of intermittent issues that may be causing slow or failed predictions for some hardware types. We're investigating. API degradation https://replicatestatus.com/incident/395660 Tue, 09 Jul 2024 12:15:00 -0000 https://replicatestatus.com/incident/395660#c048c6c40e2e021539186565cf72ae3a9e7a135daa16c6634e161d0f0b65ef5b Service has been restored. Thanks for your patience! API degradation https://replicatestatus.com/incident/395660 Tue, 09 Jul 2024 12:10:00 -0000 https://replicatestatus.com/incident/395660#5c6f24868158921e216d4f9a05e8e9029199707d7e22df1dff9ca99c62752397 We believe that we're back up and running. Thank you for bearing with us. A hardware failure left some stateful services unhappy and manual intervention was needed to bring things back online. We'll continue monitoring for the next few minutes. API degradation https://replicatestatus.com/incident/395660 Tue, 09 Jul 2024 11:59:00 -0000 https://replicatestatus.com/incident/395660#8533534e91c9303bf2e3e1ee980994c6bd78795934619d3d18df5c9680961564 We're aware that our API is serving some slow responses and prediction updates are delayed. We know what's causing this and are working on fixing the problem. Llama 3 70b instruct model not processing predictions https://replicatestatus.com/incident/393015 Wed, 03 Jul 2024 11:11:00 -0000 https://replicatestatus.com/incident/393015#f1fbfbaf76b5a92b3f527651b989821e053fd345473d1c0686de5681d9349f37 The model is processing predictions properly again, and the queue is empty. Llama 3 70b instruct model not processing predictions https://replicatestatus.com/incident/393015 Wed, 03 Jul 2024 10:23:00 -0000 https://replicatestatus.com/incident/393015#1ce3bc82c3d70e409723963666cf3fb5c125769331fe8b2f013f900f61ed88ce We're working to fix an issue that's preventing the llama 3 70b instruct model from processing predictions. Some models unavailable https://replicatestatus.com/incident/387590 Fri, 21 Jun 2024 15:40:00 -0000 https://replicatestatus.com/incident/387590#c4b12b2726256ada626b4aa696dd69a990832d52e2420cf2d3e04a6ba132f49f Service has been restored as of a few minutes ago. Some models unavailable https://replicatestatus.com/incident/387590 Fri, 21 Jun 2024 15:15:00 -0000 https://replicatestatus.com/incident/387590#841a06b11c65355233df7a337def972f41bb069c69b7ee19f803e87ff5fbaec9 We're aware of an issue with an upstream provider that means a small number of language models (Arctic, LLama 70B Instruct) are not currently available. We're working on mitigating this issue as soon as we can. Errors publishing model versions https://replicatestatus.com/incident/387221 Thu, 20 Jun 2024 22:41:00 -0000 https://replicatestatus.com/incident/387221#d0eb6879c22d819dad752586ca392306103532d2ec3ddebb1b4ca6eb0afdc5ee Model version publishing is now working as expected. Errors publishing model versions https://replicatestatus.com/incident/387221 Thu, 20 Jun 2024 22:12:00 -0000 https://replicatestatus.com/incident/387221#f7f77ef5d0e507a676b1dc26b8977fb93933e9f45ed102bd12cb87ce9de53fea Due to an outage with an upstream provider, errors are occurring when publishing new model versions. We are monitoring and will provide updates as information becomes available. Inference and general platform usage are not currently affected. Errors with inference https://replicatestatus.com/incident/378983 Tue, 04 Jun 2024 00:36:00 -0000 https://replicatestatus.com/incident/378983#1179f61736fe87a7ed75707282d822a9fe2cdae8c13bad2df8c0f0cac017265d The issues with inference was limited to select LLM models. At this time the problematic code has been rolled back and all inference should be operating normally at this time. Errors with inference https://replicatestatus.com/incident/378983 Tue, 04 Jun 2024 00:30:00 -0000 https://replicatestatus.com/incident/378983#ece07c053334ae5325930dcebc0450a74aafca6e4a8728da6f12071c1fdc32da We are aware of a series of errors occurring when running predictions on the replicate platform. We have started a rollback of the affected code. Problems booting models in one region https://replicatestatus.com/incident/377073 Thu, 30 May 2024 19:03:00 -0000 https://replicatestatus.com/incident/377073#ec512bdd5ef080460901fb7d5156033f924f3f35f846cef5ab80aaedce23db65 All outstanding issues have been resolved. Model boots and setups should be functioning normally again. Problems booting models in one region https://replicatestatus.com/incident/377073 Thu, 30 May 2024 18:08:00 -0000 https://replicatestatus.com/incident/377073#ae5da441b4d7fbf76d522f572760efba1d4ebdfd92859bb42823e65b46fd024f Models should be booting correctly once more. We'll be making a few more tweaks to improve things and will provide an update here again when we're fully back to normal. Problems booting models in one region https://replicatestatus.com/incident/377073 Thu, 30 May 2024 17:34:00 -0000 https://replicatestatus.com/incident/377073#62a8dfb7a5bd1ab3542c73c98c5e16d80353eb11282a164e848d563e4c3d67fb We're aware of issues causing model setup and weights downloads to fail in one of our clusters. We know what the problem is and are working on mitigation now. Workloads in one of our clusters are backed up https://replicatestatus.com/incident/376446 Wed, 29 May 2024 16:06:00 -0000 https://replicatestatus.com/incident/376446#bbb9ae8d5e4d225110b932d20ca380d239095baf18d4c5057202a30e26aefff1 All queues have been dealt with, and predictions and trainings are running smoothly once again. Workloads in one of our clusters are backed up https://replicatestatus.com/incident/376446 Wed, 29 May 2024 15:25:00 -0000 https://replicatestatus.com/incident/376446#b9bc8d0cff1402d32699aa62d0e148353aff4a95ab093c116755fababd348efb Predictions and trainings are running again. There are still some substantial queues, so it will take a while for the autoscaler to get everything processed. We'll monitor it until it's fully recovered. Workloads in one of our clusters are backed up https://replicatestatus.com/incident/376446 Wed, 29 May 2024 15:22:00 -0000 https://replicatestatus.com/incident/376446#4c161a5bc087e1fb06ef1fc3d4ad608c5997866bf9f3b773ee593f293dfed2ce The majority of predictions and trainings are failing to start in one of our clusters. All A40 workloads and most A100 workloads are affected. The upstream provider is investigating the issue. Workloads in one of our clusters are backed up https://replicatestatus.com/incident/376446 Wed, 29 May 2024 15:17:00 -0000 https://replicatestatus.com/incident/376446#6a6d3ff859a4ee870646300430ad23de043201dd3446ed01dea83cc6411e7ea0 We're investigating an issue with predictions and trainings in one of our clusters, due to an incident with one of our providers. Workloads running on A40s and A100s are affected. Degraded autoscaling performance https://replicatestatus.com/incident/373095 Wed, 22 May 2024 14:05:00 -0000 https://replicatestatus.com/incident/373095#607bd4f2f8edb56388e6a24ae5288e1df6d44f96ec82c4ae3e12379ccd2345f9 Backlogs have been cleared and all models are now running smoothly. Degraded autoscaling performance https://replicatestatus.com/incident/373095 Wed, 22 May 2024 12:07:00 -0000 https://replicatestatus.com/incident/373095#0915e20a586e4cfa2eb6d9f748cf06149a0a07dfccbf9a9b305d450bb5db1985 The original issue has been resolved, but we will have elevated contention for a while as workloads that built up during the outage are processed. Degraded autoscaling performance https://replicatestatus.com/incident/373095 Wed, 22 May 2024 11:05:00 -0000 https://replicatestatus.com/incident/373095#f4fd326f75ddb4e35da398354e276f74154b8f8a1983dec88c24303ade77e6f6 Models that run on A40 or A100 hardware are currently unable to boot up or scale out. Furthermore existing instances are suffering significant degradation and not all predictions are completing successfully in a timely manner. We are actively monitoring the system and working with upstream providers to resolve the issue. Degraded autoscaling performance https://replicatestatus.com/incident/373095 Wed, 22 May 2024 10:40:00 -0000 https://replicatestatus.com/incident/373095#d4c4cc892d544cc7f99848f70693c1069e6f4a139d5955bac3eb597ca1601500 Models that run on A40 or A100 hardware are currently unable to boot up or scale out. Instances that are already running will continue to process predictions as normal. We are actively monitoring the system and working with upstream providers to resolve the issue. 5XX and slow responses https://replicatestatus.com/incident/370561 Fri, 17 May 2024 00:09:00 -0000 https://replicatestatus.com/incident/370561#52ff0f65f4ab6ad27847fb5c49ba78cc5e3bc096a64d3ab5d25973dd1d117c33 The source of the problem appears to have been that our API was unable to connect to one of its underlying data stores, most likely due to a networking interruption. This has recovered as of 00:02 UTC and traffic is being served normally once again. We will continue to monitor. 5XX and slow responses https://replicatestatus.com/incident/370561 Fri, 17 May 2024 00:00:00 -0000 https://replicatestatus.com/incident/370561#828957966f0cc009a191ce33d0f1016d67c834e79285b97ed291f891503b0b36 We're aware that our API is currently serving slow responses and 5XX-series responses to some traffic and are investigating. Webhooks not sending for Dreambooth trainings https://replicatestatus.com/incident/367167 Thu, 09 May 2024 19:11:00 -0000 https://replicatestatus.com/incident/367167#f50f1665205186a6b4db27103ab116e9bbb4b1c6afae9dd3f08a4f8d25409915 Webhooks for Dreambooth trainings are working again. Webhooks not sending for Dreambooth trainings https://replicatestatus.com/incident/367167 Thu, 09 May 2024 18:19:00 -0000 https://replicatestatus.com/incident/367167#02458053d094cceaacd6b1811922467dbc91bdb99643f49302e50900b2ec042b Webhooks aren't being sent for trainings made against the Dreambooth API. We know the cause and are working on a fix. Degraded Service https://replicatestatus.com/incident/354366 Fri, 12 Apr 2024 05:20:00 -0000 https://replicatestatus.com/incident/354366#5f4b84f1d0bac78446887dc7c8b910fd14c505b5c3e3809e5e2ea3fd92fbc4e3 At this time service has been restored. All inference (prediction serving) and model instance starts have returned to normal. Degraded Service https://replicatestatus.com/incident/354366 Fri, 12 Apr 2024 04:26:00 -0000 https://replicatestatus.com/incident/354366#ca470d61e690c7aaa618a29325c3706417f9949fb1b581e90095aa90ba3fc6cd This incident is causing disruption of prediction serving and model instance start. The root cause has been identified as being related to a maintenance on our provider's networking core. We are awaiting service restoration with our provider. This incident impacts A40, A100-80, and a subset of the A100-40G hardware types. Degraded Service https://replicatestatus.com/incident/354366 Fri, 12 Apr 2024 04:15:00 -0000 https://replicatestatus.com/incident/354366#b4b25073b7d7c9a60f52269617fd614cf88303cd151c78ac926c83e46b481773 We are aware of an issue within one of our providers that is causing degraded performance with regards to prediction serving. We are monitoring the situation and will provide updates as information becomes available. Degraded service https://replicatestatus.com/incident/354087 Thu, 11 Apr 2024 14:05:00 -0000 https://replicatestatus.com/incident/354087#b0706aa236824cd279080f019e56399d61c7d4d9f1f2cf36da97e934d26d360c Our systems indicate that the problem has been resolved. We will continue monitoring the situation. Degraded service https://replicatestatus.com/incident/354087 Thu, 11 Apr 2024 13:50:00 -0000 https://replicatestatus.com/incident/354087#598913a7904e2a3350356ff4818378a38506e67dd016fd73eae0415ea0a2e9e6 An upstream provider failure is affecting predictions on a subset of models. We're working with the provider now to get things back online. Degraded service https://replicatestatus.com/incident/354087 Thu, 11 Apr 2024 13:43:00 -0000 https://replicatestatus.com/incident/354087#27f5601daee75b07eb7a1cd0b50f9cc839a5528a8ab5203cc729b86c3f26784e We're aware of degraded service for prediction execution. We're investigating the problem now and will update this incident shortly. A40 models scaling slowly https://replicatestatus.com/incident/340565 Thu, 14 Mar 2024 09:19:00 -0000 https://replicatestatus.com/incident/340565#c9d85c799e51263c31f137d1510615c1c5f8d0d1710e95f802c899a199e49158 All but a very small slice of our A40 hardware is back online, and Replicate workloads are processing normally. We again thank you for your patience. A40 models scaling slowly https://replicatestatus.com/incident/340565 Thu, 14 Mar 2024 07:33:00 -0000 https://replicatestatus.com/incident/340565#0775da052316495e3928e046b11b5752f36825e7e386a7c05f3dbae944154567 We're still working with our provider to get the remaining A40 back online. Meanwhile almost all A40 workloads are running correctly on Replicate. We'll provide an update when we're back to 100% service levels. Thank you for your patience. A40 models scaling slowly https://replicatestatus.com/incident/340565 Thu, 14 Mar 2024 06:30:00 -0000 https://replicatestatus.com/incident/340565#856ca1601cd69cdcfffa9ca95f4c3ba8c7c2df0ba618f37986d7ed87b2f03337 While most A40 hardware is scheduling, we are continuing to see some delays scaling for some models. We're working the the upstream provider to resolve the residual problems. Thank you for your patience. A40 models scaling slowly https://replicatestatus.com/incident/340565 Thu, 14 Mar 2024 05:12:00 -0000 https://replicatestatus.com/incident/340565#3230650e4308007927b4b5dfafe932d8f8736accda5c41b118f11dfe84c2bd07 Models running on A40 hardware are starting to recover. We are monitoring the situation. Replicate systems will automatically process any backlog of work. All other hardware types remain fully functional. A40 models scaling slowly https://replicatestatus.com/incident/340565 Thu, 14 Mar 2024 03:11:00 -0000 https://replicatestatus.com/incident/340565#5518ff1db9b8d770012889202ea7ca5d522b7e639dd195d6c1dfa4c0b6253a98 Our engineers have confirmed the issue is isolated to the A40 hardware type. We are working with an upstream hardware provider to restore service. A40 models scaling slowly https://replicatestatus.com/incident/340565 Thu, 14 Mar 2024 02:52:00 -0000 https://replicatestatus.com/incident/340565#ac549e890d8ad4a50c7fae656c592d41dc6f8460b52240f7f6a7b4e76c90dc20 Models running on A40 hardware are currently scaling slowly, leading to delays in handling predictions. We are working to identify what's happening here, and will give an update as soon as we know more. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 04:06:00 -0000 https://replicatestatus.com/incident/336843#b528171e81503fa24f9510d0671d7d4e16ca643f590288ad710fc6f63d3a2e74 Workloads across all regions are now running normally. We apologise for the disruption, and will working to better improve our ability to shift load between providers in situations like this one. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 04:06:00 -0000 https://replicatestatus.com/incident/336843#b528171e81503fa24f9510d0671d7d4e16ca643f590288ad710fc6f63d3a2e74 Workloads across all regions are now running normally. We apologise for the disruption, and will working to better improve our ability to shift load between providers in situations like this one. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 03:05:00 -0000 https://replicatestatus.com/incident/336843#b749ea53058d29430484314a6385c4172f8372eacef224b1c8cbeb410f6a3b4c Things remain in a degraded state but work is starting to flow again. We will continue monitoring and update when the service is fully recovered. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 03:05:00 -0000 https://replicatestatus.com/incident/336843#b749ea53058d29430484314a6385c4172f8372eacef224b1c8cbeb410f6a3b4c Things remain in a degraded state but work is starting to flow again. We will continue monitoring and update when the service is fully recovered. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 02:33:00 -0000 https://replicatestatus.com/incident/336843#0b5ca2f5846859b3d815f3ae04d53be7f0f265be4d9389b2dde3047ee6a6ce11 We're continuing to work with our provider, as one of our regions is currently unable to handle traffic. Workloads running on A40 and A100 (80GB) hardware are particularly affected. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 02:33:00 -0000 https://replicatestatus.com/incident/336843#0b5ca2f5846859b3d815f3ae04d53be7f0f265be4d9389b2dde3047ee6a6ce11 We're continuing to work with our provider, as one of our regions is currently unable to handle traffic. Workloads running on A40 and A100 (80GB) hardware are particularly affected. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 01:58:00 -0000 https://replicatestatus.com/incident/336843#0668052c411ec113511def619e26de5dd8a2f66f46bae9ed8eeb5641416c3220 The incident is involving network services within one of our providers. As the situation evolves we'll provide further updates. We apologize for the inconvenience and thank you for your patience during this time. Errors within one region https://replicatestatus.com/incident/336843 Wed, 06 Mar 2024 01:45:00 -0000 https://replicatestatus.com/incident/336843#dbb2b2b9053ec4467a38dee61f389cdffc9a80540ad835be2f53b648b3c77da0 One of our regions is seeing elevated error rates for inference and training. We are working with out provider to determine root cause and remediate the issue. This impacts A40, A100-80G, and a subset of A100-40G hardware types and can impact some language models (token based pricing). API, Inference, Web Page https://replicatestatus.com/incident/336368 Tue, 05 Mar 2024 06:54:00 -0000 https://replicatestatus.com/incident/336368#8b02b8587f44f8972fc69d175110572f78f4f84afe60c35cd3347ca0cdded27e We identified excessive load on our database. Shortly after the root cause was isolated, our engineering team disabled the problematic code paths. Degradation of database responsiveness resulted in the general service outage beginning at approximately 6:23UTC. At this time API, Inference, and Web are fully functional. Any predictions that resulted in errors or were marked as failed during this window can be safely re-run. API, Inference, Web Page https://replicatestatus.com/incident/336368 Tue, 05 Mar 2024 06:54:00 -0000 https://replicatestatus.com/incident/336368#8b02b8587f44f8972fc69d175110572f78f4f84afe60c35cd3347ca0cdded27e We identified excessive load on our database. Shortly after the root cause was isolated, our engineering team disabled the problematic code paths. Degradation of database responsiveness resulted in the general service outage beginning at approximately 6:23UTC. At this time API, Inference, and Web are fully functional. Any predictions that resulted in errors or were marked as failed during this window can be safely re-run. API, Inference, Web Page https://replicatestatus.com/incident/336368 Tue, 05 Mar 2024 06:54:00 -0000 https://replicatestatus.com/incident/336368#8b02b8587f44f8972fc69d175110572f78f4f84afe60c35cd3347ca0cdded27e We identified excessive load on our database. Shortly after the root cause was isolated, our engineering team disabled the problematic code paths. Degradation of database responsiveness resulted in the general service outage beginning at approximately 6:23UTC. At this time API, Inference, and Web are fully functional. Any predictions that resulted in errors or were marked as failed during this window can be safely re-run. API, Inference, Web Page https://replicatestatus.com/incident/336368 Tue, 05 Mar 2024 06:50:00 -0000 https://replicatestatus.com/incident/336368#526ba8961e9d2fac46ca9a1d3066803095417708a112fdf94f0602bd88ae0a89 We are aware of an incident that is impacting all aspects of the replicate service. Engineers are working on remediating the issue to bring everything back online as quickly as possible. API, Inference, Web Page https://replicatestatus.com/incident/336368 Tue, 05 Mar 2024 06:50:00 -0000 https://replicatestatus.com/incident/336368#526ba8961e9d2fac46ca9a1d3066803095417708a112fdf94f0602bd88ae0a89 We are aware of an incident that is impacting all aspects of the replicate service. Engineers are working on remediating the issue to bring everything back online as quickly as possible. API, Inference, Web Page https://replicatestatus.com/incident/336368 Tue, 05 Mar 2024 06:50:00 -0000 https://replicatestatus.com/incident/336368#526ba8961e9d2fac46ca9a1d3066803095417708a112fdf94f0602bd88ae0a89 We are aware of an incident that is impacting all aspects of the replicate service. Engineers are working on remediating the issue to bring everything back online as quickly as possible. Models Affected by Hugging Face Hub Outage https://replicatestatus.com/incident/333423 Thu, 29 Feb 2024 01:57:00 -0000 https://replicatestatus.com/incident/333423#5ecc222dd40af95357da9b9acbba5acbeec0c49df3dc2cca8b23d1b711001161 We are seeing HF Hub return to full functionality. The backlog of models blocked on interacting (downloading or otherwise) with HuggingFace have recovered. Models Affected by Hugging Face Hub Outage https://replicatestatus.com/incident/333423 Wed, 28 Feb 2024 23:52:00 -0000 https://replicatestatus.com/incident/333423#237ce1261ab9903401fd8f76e927f77e953fda5ffe52243273d77875f33a6a12 We have seen a significant increase of errors for models starting that rely on the HuggingFace Hub. We are monitoring the situation and sending hugsops their way. The Replicate team will work to ensure that once HF Hub is available, the models can be successfully unblocked and started. Create prediction/training API unavailable https://replicatestatus.com/incident/332135 Mon, 26 Feb 2024 13:13:00 -0000 https://replicatestatus.com/incident/332135#e9ba460d94992d9aa93357226b4802b5f43442bb2b37cbfc8cf67fbd43f1182c From approximately 13:05 to 13:11 UTC, our prediction and training creation endpoints were unavailable. Existing predictions and trainings were unaffected, but no new predictions or trainings could be created. The problem has since been resolved. Create prediction/training API unavailable https://replicatestatus.com/incident/332135 Mon, 26 Feb 2024 13:13:00 -0000 https://replicatestatus.com/incident/332135#e9ba460d94992d9aa93357226b4802b5f43442bb2b37cbfc8cf67fbd43f1182c From approximately 13:05 to 13:11 UTC, our prediction and training creation endpoints were unavailable. Existing predictions and trainings were unaffected, but no new predictions or trainings could be created. The problem has since been resolved. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 05:01:47 +0000 https://replicatestatus.com/incident/329654#164150ea27639c226043d499b26d982ec25e90b4495ebf24382a2521c5a5efae Maintenance completed Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 05:01:47 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 05:01:47 +0000 https://replicatestatus.com/incident/329654#164150ea27639c226043d499b26d982ec25e90b4495ebf24382a2521c5a5efae Maintenance completed Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 05:01:47 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 04:38:08 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 04:38:08 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 04:00:43 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 04:00:43 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 04:00:00 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Maintenance within one region https://replicatestatus.com/incident/329654 Fri, 23 Feb 2024 04:00:00 -0000 https://replicatestatus.com/incident/329654#c61e41b5770fdb83e2d98da6c64d7715eb45c4fc9b854e916ae3bc98c2ab4d18 We are working with one of our providers to implement mitigations based upon a series of recent outages. These mitigations will ensure that our accelerated data delivery and other associated services will be more resilient. During this maintenance we expect some slower than normal model boot times impacting the following Hardware Types: A40 A100-80g Additionally a subset of A100-40G hardware types will be affected. The slower-than normal boot times should not be wide spread. We will be making adjustments to the internal Content Delivery Systems to ensure retries (via a fallback-to-origin mechanism) are served normally. This maintenance notification will be updated if the scheduled time or other details change. Dropped predictions https://replicatestatus.com/incident/330603 Thu, 22 Feb 2024 17:46:00 -0000 https://replicatestatus.com/incident/330603#56f38c914231685ad8b73b13b0ef21f40763c9212fdd20bd391648e29777fcea This issue has been resolved and predictions are now flowing again. Dropped predictions https://replicatestatus.com/incident/330603 Thu, 22 Feb 2024 17:43:00 -0000 https://replicatestatus.com/incident/330603#ef37da9d6ad02ce06ae51069723bce7a84db1e3f40f3a764bf02e7b04d7d4e24 We're aware of a problem where a small proportion of predictions are being dropped and will not complete. The fix is rolling out now. Dropped predictions will be marked as failed and will not be billed. You can safely retry these predictions. API errors https://replicatestatus.com/incident/330408 Thu, 22 Feb 2024 09:53:00 -0000 https://replicatestatus.com/incident/330408#ccb67c4d8271ee93042c52501f4cde1722f0dc24795a8c96c868ddea025bb913 The problems resolved automatically at 09:41 UTC. We are monitoring the situation. API errors https://replicatestatus.com/incident/330408 Thu, 22 Feb 2024 09:42:00 -0000 https://replicatestatus.com/incident/330408#61b6a95f03547c246c153896a6cdff286baeb809426a393b93552198a51cf6fe We're aware that for about 5 minutes from 09:36 to 09:41 UTC, the Replicate API returned HTTP 500 errors for many calls, including particularly for prediction creates. It appears that an essential database failed or was unreachable, and failover mechanisms don't appear to have worked correctly. We are looking into what happened. All workloads are currently functioning correctly. Models stuck booting https://replicatestatus.com/incident/329051 Mon, 19 Feb 2024 16:27:00 -0000 https://replicatestatus.com/incident/329051#97250529746175e490dccdea809a886c49e6470bc9a5c32291dba663b8cfafc9 The models stuck in booting have been fixed, and inference and trainings are both running normally again. Models stuck booting https://replicatestatus.com/incident/329051 Mon, 19 Feb 2024 16:16:00 -0000 https://replicatestatus.com/incident/329051#de8f91a8efca99c94bea0fc11d81058134e882f34d7a87e768a8936ba2d8b12e We're starting to see recovery of this issue, and will continue monitoring until things are back to normal. Models stuck booting https://replicatestatus.com/incident/329051 Mon, 19 Feb 2024 15:55:00 -0000 https://replicatestatus.com/incident/329051#65651d439ae970d95466fa621a0d3f147a9dc95c25494591f39e88088c7bc134 We're aware that some models are currently stuck booting. We know what the issue is and are in the process of fixing it. We expect normal service will resume in a few minutes. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 23:52:00 -0000 https://replicatestatus.com/incident/327803#b8a3aa60f863b21a592bdc37371308bf6902173b0ef03375065ce7e9e8ab07e7 Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery. For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up for you next week. If you have immediate concerns please feel free to reach out to our customer team and we'll make it right. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 23:52:00 -0000 https://replicatestatus.com/incident/327803#b8a3aa60f863b21a592bdc37371308bf6902173b0ef03375065ce7e9e8ab07e7 Well, lots of things went wrong today. We've identified what we think are the last few things that were broken and fixed them. A newly rolled out internal queueing service didn't allow traffic from model pods, and that caused our prediction throughput to be far lower than normal, which was impeding recovery. For an incident of this magnitude we fully understand that many of our customers will want to know what happened. We're starting to piece it together and we'll have a proper write-up for you next week. If you have immediate concerns please feel free to reach out to our customer team and we'll make it right. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 21:29:00 -0000 https://replicatestatus.com/incident/327803#9e464d93abc48c13cd1a20214679cbac89499104cd5344fa48fd10464b3202b2 Demand and backlog remain high for GPUs in one of our regions. We have rebalanced traffic and working with our providers to further increase available GPUs to get the backlog worked through. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 21:29:00 -0000 https://replicatestatus.com/incident/327803#9e464d93abc48c13cd1a20214679cbac89499104cd5344fa48fd10464b3202b2 Demand and backlog remain high for GPUs in one of our regions. We have rebalanced traffic and working with our providers to further increase available GPUs to get the backlog worked through. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 19:45:00 -0000 https://replicatestatus.com/incident/327803#6cb898c77d420fe59094b4de8799658439812195359e40a45a89b085185349f2 We continue to see high demand and slow scheduling within one of our providers. Additionally we have drastically increased our GPU count to address the continued backlog. Engineers are working to rebalance traffic between providers to accelerate recovery. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 19:45:00 -0000 https://replicatestatus.com/incident/327803#6cb898c77d420fe59094b4de8799658439812195359e40a45a89b085185349f2 We continue to see high demand and slow scheduling within one of our providers. Additionally we have drastically increased our GPU count to address the continued backlog. Engineers are working to rebalance traffic between providers to accelerate recovery. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 18:10:00 -0000 https://replicatestatus.com/incident/327803#24dd5780914325ab96b939c83e3bba343d7b72915c76ca6aea1e95b23d9685e2 Backlog of a100 models scheduling continues to be slow in one of our regions. We are working through the backlog of queue and scaling of replicas. Engineers are closely monitoring and will provide an update once the backlog has been cleared. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 18:10:00 -0000 https://replicatestatus.com/incident/327803#24dd5780914325ab96b939c83e3bba343d7b72915c76ca6aea1e95b23d9685e2 Backlog of a100 models scheduling continues to be slow in one of our regions. We are working through the backlog of queue and scaling of replicas. Engineers are closely monitoring and will provide an update once the backlog has been cleared. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 17:59:00 -0000 https://replicatestatus.com/incident/327803#5e8ef6c843608bd87acae9f10470e1846abf0e2355292fbe0b5ab33750e7ddc3 We are seeing recovery start and many models are starting. In addition we are looking into associated services (content-delivery-acceleration, etc) to ensure all services are returning to normal working order. Most models are now fully started and the backlog is minimal. An additional update will be provided as soon as all services have been verified. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 17:59:00 -0000 https://replicatestatus.com/incident/327803#5e8ef6c843608bd87acae9f10470e1846abf0e2355292fbe0b5ab33750e7ddc3 We are seeing recovery start and many models are starting. In addition we are looking into associated services (content-delivery-acceleration, etc) to ensure all services are returning to normal working order. Most models are now fully started and the backlog is minimal. An additional update will be provided as soon as all services have been verified. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 17:48:00 -0000 https://replicatestatus.com/incident/327803#94f321684984c0fc41d804c4b9cd75cf7b474c9abfb7fe8658a14f9c85280fcd An problem has been identified an issue within one of our regions preventing model startups and downloads at runtime (inference). We are working with our providers and within the region to correct the problem. Updates will be provided as they become available. This impacts workloads on A40, some A100-40g hardware types, and A100-80g hardware types. Model startup Errors / Runtime Download Errors https://replicatestatus.com/incident/327803 Fri, 16 Feb 2024 17:48:00 -0000 https://replicatestatus.com/incident/327803#94f321684984c0fc41d804c4b9cd75cf7b474c9abfb7fe8658a14f9c85280fcd An problem has been identified an issue within one of our regions preventing model startups and downloads at runtime (inference). We are working with our providers and within the region to correct the problem. Updates will be provided as they become available. This impacts workloads on A40, some A100-40g hardware types, and A100-80g hardware types. Errors downloading weights on model startup https://replicatestatus.com/incident/325064 Sat, 10 Feb 2024 23:16:00 -0000 https://replicatestatus.com/incident/325064#1f83fbfa1542c9abad0918b8e39c5e18d8900fc2f98019128c3885b92943c0bb We've not seen any failures after 22:50 UTC, so we're calling this incident resolved. Our investigation revealed that internal DNS lookup failures put a storage cache subsystem into a broken state. Next week we'll be looking into how to make our systems more robust in situations like this one. Thank you for your patience. Errors downloading weights on model startup https://replicatestatus.com/incident/325064 Sat, 10 Feb 2024 22:44:00 -0000 https://replicatestatus.com/incident/325064#027096bf72bd0cd0606e56cf4896762d72e557e292abd724e870b733613d1e0b As far as we can tell things are looking a lot better. We're continuing to monitor the situation for the time being. Errors downloading weights on model startup https://replicatestatus.com/incident/325064 Sat, 10 Feb 2024 22:08:00 -0000 https://replicatestatus.com/incident/325064#695fc079627e72a15a009c0fefb0a03d9b24da4f0a6b6b99b07bbc55f1d15011 We have identified the cause of this issue and are rolling out a fix. Errors downloading weights on model startup https://replicatestatus.com/incident/325064 Sat, 10 Feb 2024 21:52:00 -0000 https://replicatestatus.com/incident/325064#a911b800386a324764ac5eb186de4a9936ff62c469802aa26f21a7af2254257d We are seeing elevated incidences of weights failing to download on model startup. Errors downloading weights on model startup https://replicatestatus.com/incident/324264 Thu, 08 Feb 2024 22:00:00 -0000 https://replicatestatus.com/incident/324264#3258982f9d3dfa81f7cf4f1c85ff31165ce549b5ba5988ff4c49ffcdc8910a0a Mitigations are in place and now models are again downloading weights as expected. All model startups within the affected region have returned to normal. Errors downloading weights on model startup https://replicatestatus.com/incident/324264 Thu, 08 Feb 2024 21:53:00 -0000 https://replicatestatus.com/incident/324264#11dba2a6d51a1a8850a5d95208d49ddd16b637d122154b655279397b415c974f We are seeing elevated incidences of weights failing to download on model startup within one of our regions. A problematic bit of code has been identified and a fix is in progress. Additionally we are putting in place immediate mitigations to help limit the overall impact while the code fix is being worked on. Trained versions failing setup https://replicatestatus.com/incident/323144 Tue, 06 Feb 2024 18:30:00 -0000 https://replicatestatus.com/incident/323144#2c2c1d33803aad0e63cc5208a417c9b7a5c87883ddc0f7c205939147e3e7505e We've resolved this issue and have re-enabled all affected versions. Trained versions failing setup https://replicatestatus.com/incident/323144 Tue, 06 Feb 2024 17:36:00 -0000 https://replicatestatus.com/incident/323144#cec9ab0964b8eceea7d043e0d8b5ed629b209f25730c36db047f291808a4997e Some trained versions are currently failing to setup, and are being disabled. We're working to resolve the underlying issue, and will then re-enable all affected versions. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 18:31:00 -0000 https://replicatestatus.com/incident/319344#ca97aa9a0194db43ff88b938ea825739217b6594d4a86cb1c9801d59fc271cd7 The backlog of models has been cleared. Model start time is back to expected times. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 18:31:00 -0000 https://replicatestatus.com/incident/319344#ca97aa9a0194db43ff88b938ea825739217b6594d4a86cb1c9801d59fc271cd7 The backlog of models has been cleared. Model start time is back to expected times. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 16:55:00 -0000 https://replicatestatus.com/incident/319344#0050aaa77a6e92716d1e684dff389e4fe911b3cf8698e6e981c0e42bc33fc1ca A fix has been rolled out and we are working through the backlog of model starts. We will provide an update once the backlog has been completed. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 16:55:00 -0000 https://replicatestatus.com/incident/319344#0050aaa77a6e92716d1e684dff389e4fe911b3cf8698e6e981c0e42bc33fc1ca A fix has been rolled out and we are working through the backlog of model starts. We will provide an update once the backlog has been completed. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 16:44:00 -0000 https://replicatestatus.com/incident/319344#98bb1e22b0db849e88daf661fac911accd741a819410e603b78e19ddefcb8d1c A revert to an identified broken deployment is being rolled out. We will provide updates as the fix progresses. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 16:44:00 -0000 https://replicatestatus.com/incident/319344#98bb1e22b0db849e88daf661fac911accd741a819410e603b78e19ddefcb8d1c A revert to an identified broken deployment is being rolled out. We will provide updates as the fix progresses. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 16:42:00 -0000 https://replicatestatus.com/incident/319344#571d609d840ef49668007a34a23a3551e2514c53bb5ee9353c63ce99e528893c We have identified an issue within one of our regions pretending models from starting. A fix is being worked on to remediate the issue. Delayed Model Start https://replicatestatus.com/incident/319344 Mon, 29 Jan 2024 16:42:00 -0000 https://replicatestatus.com/incident/319344#571d609d840ef49668007a34a23a3551e2514c53bb5ee9353c63ce99e528893c We have identified an issue within one of our regions pretending models from starting. A fix is being worked on to remediate the issue. Errors completing predictions https://replicatestatus.com/incident/316307 Mon, 22 Jan 2024 19:34:00 -0000 https://replicatestatus.com/incident/316307#038e887d5ba83e619d31bc4605ce3710e5d7c21634666e5d7f14fe5fbf5eb8cc A fix has been rolled out to the majority of models and errors rate has returned to normal levels. We will continue to monitor to address any more occurrences of the errors. Predictions affected by this incident (many on T4 gpus, CPU, and a subset of a100s) will appear to be stuck in the starting phase for an extended period of time. These predictions can safely be cancelled and reattempted. Errors completing predictions https://replicatestatus.com/incident/316307 Mon, 22 Jan 2024 19:12:00 -0000 https://replicatestatus.com/incident/316307#a2a0245b0ef77781aecc522c8d2bfe9590121a54cea379f17c7e8d9889921cd0 We have identified an error due to a failed deployment. A fix rollout is in progress. Models are recovering the ability to complete predictions as the rollout progresses. Errors completing predictions https://replicatestatus.com/incident/316307 Mon, 22 Jan 2024 18:31:00 -0000 https://replicatestatus.com/incident/316307#54b35913cb6c9d1b18a3d9d7adfe766cbc5a31cbc7387e1147d53fa4699eeb9f We believe we have identified the source of the failures and are working on an update so that a fix can be rolled out. Errors completing predictions https://replicatestatus.com/incident/316307 Mon, 22 Jan 2024 18:22:00 -0000 https://replicatestatus.com/incident/316307#d50f6dbdc5ed9d663772a485a10af802b37d4f670d12ee38042bc7cb13f5b2d8 We are seeing errors occurring in one of our regions. We are currently investigating the errors and will provide updates as they are available. Erroneous Alerts of Website Down https://replicatestatus.com/incident/314190 Thu, 18 Jan 2024 00:37:00 -0000 https://replicatestatus.com/incident/314190#4fb067e4e325b7104cd9ac59f70d9d1b32563fa2d43e08825b2359ebb4e21268 There have been a number of automated reports the Replicate website has gone down/returned to service. We are investigating the automated systems but do not see any current outages outside of the automated tooling. Model Start times https://replicatestatus.com/incident/310437 Tue, 09 Jan 2024 17:23:00 -0000 https://replicatestatus.com/incident/310437#9df994771462df8f1664b030d9a0e271bc81cb0103150c5fe02e90a8cadd7da0 The incident is in process of clearing and model start times have returned to normal. Model Start times https://replicatestatus.com/incident/310437 Tue, 09 Jan 2024 17:10:00 -0000 https://replicatestatus.com/incident/310437#4a838a0855e0f3782012b953f4625d155f2428753e1a3ad9e3ee82a4a88bb972 We have identified an issue within one of our regions that is noticably increasing model start times. Investigation is under way to identify and remediate the problem. Slow Model Startup Time https://replicatestatus.com/incident/308962 Fri, 05 Jan 2024 19:13:00 -0000 https://replicatestatus.com/incident/308962#acc5452270d1deee5604b744757da5247b8ea0e241f59cb1ea3a56d9e39ae33c At this time all models in the backlog have finished startup. We will continue to monitor the situation closely. Slow Model Startup Time https://replicatestatus.com/incident/308962 Fri, 05 Jan 2024 19:07:00 -0000 https://replicatestatus.com/incident/308962#eefe1fc8ae2d9d81d3e4142b351e52dd9eef975b5b7af6ad33bf76a6a556bfbc We identified an issue with one of our regions that was causing significant delays in model setup time. We are working with one of our providers to address the issue. At this time most models have successfully setup and the limited backlog is being worked through. Startup time should be normal outside of some very limited subset in the backlog. Boot time issues for Models https://replicatestatus.com/incident/307592 Tue, 02 Jan 2024 16:46:00 -0000 https://replicatestatus.com/incident/307592#d2f4499aaded1d6c963cc85f9a6ff67530079b90f469b733b36dd4ea1d75b040 We are aware of an event in one of our regions that resulted in extended boot times of many models. At this time the incident has resolved. We are actively researching the root cause and will work to build remediations to limit impact of future such events. Boot time issues for Models https://replicatestatus.com/incident/307592 Tue, 02 Jan 2024 16:46:00 -0000 https://replicatestatus.com/incident/307592#d2f4499aaded1d6c963cc85f9a6ff67530079b90f469b733b36dd4ea1d75b040 We are aware of an event in one of our regions that resulted in extended boot times of many models. At this time the incident has resolved. We are actively researching the root cause and will work to build remediations to limit impact of future such events. Intermittent Failures due to networking https://replicatestatus.com/incident/305176 Tue, 26 Dec 2023 18:58:00 -0000 https://replicatestatus.com/incident/305176#beeedb304e0fa465b1200149f00a11c656f0fc7b9a858adc9e5ed6bebf8e1b2b The error rate seen has subsided and models are seeing previous startup and runtime behavior. We are working with our providers mitigate impact of future incidents like this. Intermittent Failures due to networking https://replicatestatus.com/incident/305176 Tue, 26 Dec 2023 17:05:00 -0000 https://replicatestatus.com/incident/305176#c5a13ecfd70d68fc3a841eb36d390cc348b3c9db1a2a696b1a25b852183b9f0d We are seeing a reduction in error rate. The root cause is still under investigation. Intermittent Failures due to networking https://replicatestatus.com/incident/305176 Tue, 26 Dec 2023 15:39:00 -0000 https://replicatestatus.com/incident/305176#490782735ddcfb27b86b6b08ba6c8c9c0738905d8fd4b81675b8c30e26772506 The issue has been narrowed down and the observed errors are only seen in specific subset of infrastructure in a single region. We are continuing to investigate and will provide further updates as information becomes available. Intermittent Failures due to networking https://replicatestatus.com/incident/305176 Tue, 26 Dec 2023 15:33:00 -0000 https://replicatestatus.com/incident/305176#383709f30362ab0e09cf12f5ff5b7eeb1e2f6d5cc985cd7fc3bb8eae2deaf0fc We are seeing elevated errors within one of our regions relating to networking issues. This is under active investigation. This is primarily presenting as failed model setup and errors when downloading weights. These events present as groupings for short windows and then cease. We will provide updates as more information becomes available. Models not starting https://replicatestatus.com/incident/303930 Fri, 22 Dec 2023 20:49:00 -0000 https://replicatestatus.com/incident/303930#3bbee92f445a48086bd5e0bd40df35ec26a8f0343ac00305f4e384dcc647d057 The fix has been deployed and all model starts should be back to normal. Models not starting https://replicatestatus.com/incident/303930 Fri, 22 Dec 2023 20:34:00 -0000 https://replicatestatus.com/incident/303930#f0f1c7f88272b89a52a7a24af13aa0f1332128877eeaabc4e366e27f314c18c5 We have identified the problematic change and are in process of deploying a fix to remedy the problem. Most models saw no issues, a small subset may have been impacted. In almost all cases the models still successfully started but took slightly longer than normal. Models not starting https://replicatestatus.com/incident/303930 Fri, 22 Dec 2023 20:22:00 -0000 https://replicatestatus.com/incident/303930#73d2fa2a034f294b5997b139742e7abaf96b6e16abfbaf3712108b3f7f266d50 We have identified the underlying issue and are working to deploy a fix. Models not starting https://replicatestatus.com/incident/303930 Fri, 22 Dec 2023 19:55:00 -0000 https://replicatestatus.com/incident/303930#9017e8b9a18b24cd2f9349e857ce75fd5173792e910b3cc7dd50d3f44d98ba86 We have identified an issue within one of our provider that is causing a number of models to not start. We are working to identify the root cause of the issue. Model Setup Failures https://replicatestatus.com/incident/303582 Fri, 22 Dec 2023 02:15:00 -0000 https://replicatestatus.com/incident/303582#1a44ef9f5382250e9b36c8b4667d2e347d342cb050fbbc8b4a01bd2e3e00e493 All services are working as expected and all workarounds have been restored to normal behavior. Additionally we have made improvements to ensure we can more quickly respond by adding mitigations to any future incidents of this manner. Model Setup Failures https://replicatestatus.com/incident/303582 Fri, 22 Dec 2023 01:49:00 -0000 https://replicatestatus.com/incident/303582#5e0d852589d3f08d1bb29bfdd11ebd9075d205f5760f65842ea443a3b7cfc15f We have confirmed our mitigations have been propagated and that our provider has resolved the underlying issue. We will be removing our change and restoring normal service shortly. Model Setup Failures https://replicatestatus.com/incident/303582 Fri, 22 Dec 2023 01:22:00 -0000 https://replicatestatus.com/incident/303582#a90375e80ea12f7afb917654142e2b46f3a6840c4c9c475879a4aeea9d9253a2 Due to an outage with one of our providers we are seeing elevated model setup failures. We have taken steps to limit the impact, but expect the error rates to remain elevated while our change propagates. Model setup failing https://replicatestatus.com/incident/303085 Thu, 21 Dec 2023 00:39:00 -0000 https://replicatestatus.com/incident/303085#3977a603f74ece80c831960d2fe72dc6bc329f4522ac811e6eb7e7d380a9ca37 Code has been rolled back and models are no longer failing setup due to this issue. Model setup failing https://replicatestatus.com/incident/303085 Thu, 21 Dec 2023 00:12:00 -0000 https://replicatestatus.com/incident/303085#e8946691f2fae162baa2230ff575af6f302ca5b2762ef912fce756fd22252512 We are aware of an issue causing some models to fail setup. We are rolling back the problematic code. The vast majority of models are working as expected. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:46:00 -0000 https://replicatestatus.com/incident/302501#567a813d2dde57a37fc78ba6c450da4998b1c797f7fed9fbb15cca18d09ad97a All queues have been processed and service should be back to normally. Sorry for the interruption folks. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:46:00 -0000 https://replicatestatus.com/incident/302501#567a813d2dde57a37fc78ba6c450da4998b1c797f7fed9fbb15cca18d09ad97a All queues have been processed and service should be back to normally. Sorry for the interruption folks. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:28:00 -0000 https://replicatestatus.com/incident/302501#f2f9df22db2d00447540c6a59f6b219537d0e72973a14dd914c5410079161601 We've fixed the autoscaler in the affected region, and it's now starting instances to process the queues. We'll monitor the situation until the queues are back to normal. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:28:00 -0000 https://replicatestatus.com/incident/302501#f2f9df22db2d00447540c6a59f6b219537d0e72973a14dd914c5410079161601 We've fixed the autoscaler in the affected region, and it's now starting instances to process the queues. We'll monitor the situation until the queues are back to normal. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:24:00 -0000 https://replicatestatus.com/incident/302501#85eec483edd4eba80b7beeda522eaa814488d99c8f1631552c20de586582a4f2 We've identified the cause of the issue, and are applying a fix now. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:24:00 -0000 https://replicatestatus.com/incident/302501#85eec483edd4eba80b7beeda522eaa814488d99c8f1631552c20de586582a4f2 We've identified the cause of the issue, and are applying a fix now. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:18:00 -0000 https://replicatestatus.com/incident/302501#e9c81b170e18b312430f2618d54c27f04c8fdac1c110e17f4fc01094e5eb4699 We're investigating an issue with models not booting for one of our cloud providers. Models not booting https://replicatestatus.com/incident/302501 Tue, 19 Dec 2023 13:18:00 -0000 https://replicatestatus.com/incident/302501#e9c81b170e18b312430f2618d54c27f04c8fdac1c110e17f4fc01094e5eb4699 We're investigating an issue with models not booting for one of our cloud providers. Slow Model Startup https://replicatestatus.com/incident/296948 Wed, 06 Dec 2023 22:14:00 -0000 https://replicatestatus.com/incident/296948#e415491375f77b3aff5a31a83bb5c3eacb0d82e1533c4b1926e4a6caa4b8755a We have cleared up the backlog of models seeing a slow starts. Slow Model Startup https://replicatestatus.com/incident/296948 Wed, 06 Dec 2023 21:45:00 -0000 https://replicatestatus.com/incident/296948#816fa810c768d87cbed7d356661fdf1c4058c42f6252caf3bcbd58b4b47a59d3 Delays are being experienced in starting some models in one of our regions. We have engaged with our provider to address the issue. NVIDIA Driver Issues https://replicatestatus.com/incident/295102 Sat, 02 Dec 2023 15:15:00 -0000 https://replicatestatus.com/incident/295102#326b7200e4d3e5bb70bed87a49ad9e43d69b77836b78a7881d8db23e51740cbe We have identified a few nodes within one of our regions that exhibit issues with NVIDIA drivers not being installed. We have isolated these nodes from further workload scheduling (both inference and training) and will recycle the problematic nodes. NVIDIA Driver Issues https://replicatestatus.com/incident/295102 Sat, 02 Dec 2023 15:15:00 -0000 https://replicatestatus.com/incident/295102#326b7200e4d3e5bb70bed87a49ad9e43d69b77836b78a7881d8db23e51740cbe We have identified a few nodes within one of our regions that exhibit issues with NVIDIA drivers not being installed. We have isolated these nodes from further workload scheduling (both inference and training) and will recycle the problematic nodes. Container Images pull delays https://replicatestatus.com/incident/294857 Fri, 01 Dec 2023 22:34:00 -0000 https://replicatestatus.com/incident/294857#76ea62ee47a164d3ecad607f1c0585f3112d271a95334098d287cb929bd349a3 Thank you for your patience. We have cleared up the remaining backlog of pending workloads. Inference and Trainings are now running as expected for all hardware types. Container Images pull delays https://replicatestatus.com/incident/294857 Fri, 01 Dec 2023 22:25:00 -0000 https://replicatestatus.com/incident/294857#4d9b8b01bd2d9f3af8fee781cd229e83f8ea1b092fda085bd6f09a1079458e1f Image pull delays have been resolved with our provider. We are working with the provider to identify and remediate the underlying root cause. There is a small backlog of workloads (both inference and training) for larger hardware types (e.g. 8x A40 GPU) that we are working through. Container Images pull delays https://replicatestatus.com/incident/294857 Fri, 01 Dec 2023 22:09:00 -0000 https://replicatestatus.com/incident/294857#dc5438fbe7d83c74c66ef1ddb2c0ed7bd25f59767c95bc0d9b63e2ac978105d9 We are experiencing an issue with one of our providers with container images seeing delays reaching the region. This is impacting primarily A40, and some A100 hardware targets. We are working with our provider to get to the root of the problem. A100 GPU maintenance https://replicatestatus.com/incident/287114 Tue, 14 Nov 2023 18:59:29 -0000 https://replicatestatus.com/incident/287114#0a7dc66449ada4cce5cf701b27568205235c8159f7efe881ea1c1e10c367d116 We have been informed by one of our providers that they need to perform urgent maintenance on some of our GPUs. We are in the process of shifting traffic in the hopes of minimizing impact on our customers. There may still be scheduling delays for A100 GPU traffic. A100 GPU maintenance https://replicatestatus.com/incident/287114 Tue, 14 Nov 2023 18:45:00 +0000 https://replicatestatus.com/incident/287114#b6bed43f9db2a7530ced1fdb642a5a265b9c1ebb980a7b65a5924e01dfaf952e Maintenance completed A100 GPU maintenance https://replicatestatus.com/incident/287114 Tue, 14 Nov 2023 16:45:00 -0000 https://replicatestatus.com/incident/287114#0a7dc66449ada4cce5cf701b27568205235c8159f7efe881ea1c1e10c367d116 We have been informed by one of our providers that they need to perform urgent maintenance on some of our GPUs. We are in the process of shifting traffic in the hopes of minimizing impact on our customers. There may still be scheduling delays for A100 GPU traffic. Problems running some A40 models https://replicatestatus.com/incident/285385 Fri, 10 Nov 2023 22:09:00 -0000 https://replicatestatus.com/incident/285385#1dcab06d7beb2c03128c36586355c3fa5bc621decb2a38a166dbed3ce0647026 We have confirmed and corrected any model versions erroneously disabled during this issue. Use of A40s for predictions and trainings is now working as expected. Problems running some A40 models https://replicatestatus.com/incident/285385 Fri, 10 Nov 2023 21:59:00 -0000 https://replicatestatus.com/incident/285385#b4f0a3196f2d1095e52bc6aa372ebc5fd628dbfdaa625536480fa72165a89a0f We have completed the rollout of the fix and are in process of ensuring all model versions impacted by this are not in a disabled state. Problems running some A40 models https://replicatestatus.com/incident/285385 Fri, 10 Nov 2023 21:40:00 -0000 https://replicatestatus.com/incident/285385#4abb8f20eb7b2ce7d578a16435bd52e80bfe1f88b6927a65a7294cc0e83d25d8 We are rolling out a fix to address the underlying issue. Some model versions will still be in a broken state until the rollout is complete. Problems running some A40 models https://replicatestatus.com/incident/285385 Fri, 10 Nov 2023 18:53:00 -0000 https://replicatestatus.com/incident/285385#877e2e5f47403861c7cc34619764ee289f0c39556ac00af3fb5407b44fdb8608 We have identified what's going wrong here and are working to get the affected models fixed as soon as possible. Problems running some A40 models https://replicatestatus.com/incident/285385 Fri, 10 Nov 2023 16:59:00 -0000 https://replicatestatus.com/incident/285385#780febbd8d0903d47386e3e1be578591ba9f225e8555ff8466b4bb0f96d58e66 We're aware of some issues affecting some models running on A40 hardware that are preventing models from booting. We're investigating and will provide an update as soon as we have one. Replicate website unavailable https://replicatestatus.com/incident/284333 Wed, 08 Nov 2023 15:53:00 -0000 https://replicatestatus.com/incident/284333#d6e8588d74711bfceaf7b72f7817088e9dfb929075d1729ffa43a2daf75689f9 It looks to us like one of our providers had a brief outage and things are now coming back. We're continuing to monitor the situation. (Technical details: it looks like an upstream provider had a brief DNSSEC zone signing outage.) Replicate website unavailable https://replicatestatus.com/incident/284333 Wed, 08 Nov 2023 15:53:00 -0000 https://replicatestatus.com/incident/284333#d6e8588d74711bfceaf7b72f7817088e9dfb929075d1729ffa43a2daf75689f9 It looks to us like one of our providers had a brief outage and things are now coming back. We're continuing to monitor the situation. (Technical details: it looks like an upstream provider had a brief DNSSEC zone signing outage.) Replicate website unavailable https://replicatestatus.com/incident/284333 Wed, 08 Nov 2023 15:49:00 -0000 https://replicatestatus.com/incident/284333#ee0de9a4fbfd422a395eaffe3ae579a73ad8ef1845a251428ba42b1d0dd4e06b The replicate.com website is currently down. We're working to address the issue as fast as we can. Replicate website unavailable https://replicatestatus.com/incident/284333 Wed, 08 Nov 2023 15:49:00 -0000 https://replicatestatus.com/incident/284333#ee0de9a4fbfd422a395eaffe3ae579a73ad8ef1845a251428ba42b1d0dd4e06b The replicate.com website is currently down. We're working to address the issue as fast as we can. Slow model startup in some cases https://replicatestatus.com/incident/282850 Mon, 06 Nov 2023 00:53:00 -0000 https://replicatestatus.com/incident/282850#531a2b70bf58acb9a5c3338df62279ae5b6930751cd23bdf190449ab9cfe07de The slow model startup has resolved. We will continue to work internally and with our provider to remediate the root cause. Slow model startup in some cases https://replicatestatus.com/incident/282850 Mon, 06 Nov 2023 00:18:00 -0000 https://replicatestatus.com/incident/282850#970b9708f9190047b19e07341e4199f882f7d953720e0a24b4e66fb3662e1fb8 In one of our regions see slower than expected model startup times. We have engaged with our provider and are working to isolate the root cause. This should not impact predictions or training completions. Slower predictions and webhook delivery https://replicatestatus.com/incident/282596 Sun, 05 Nov 2023 08:03:00 -0000 https://replicatestatus.com/incident/282596#1aa77a30386dc0cbf6e9026b4aee0d2495b6c8d6777edae45243752017e6262d The prediction and webhook delivery issues are resolved now. There might be still a delay in webhook delivery of older predictions. Slower predictions and webhook delivery https://replicatestatus.com/incident/282596 Sun, 05 Nov 2023 07:50:00 -0000 https://replicatestatus.com/incident/282596#11026591db56a2de72ad6f702c5749232c1fbddc40e73f2c98b63e4712a8a903 We see an improvement in prediction and webhook delivery times. We will keep monitoring the situation. Slower predictions and webhook delivery https://replicatestatus.com/incident/282596 Sun, 05 Nov 2023 07:02:00 -0000 https://replicatestatus.com/incident/282596#28630c5fd8a1512950674477e089ac729c6f4731484fba32e8a18a19323f71fb We are investigating some predictions and webhook delivery issues. Investigating predictions creation issues https://replicatestatus.com/incident/281535 Thu, 02 Nov 2023 18:53:00 -0000 https://replicatestatus.com/incident/281535#a4e496cfd61dc5ba6f25db183091715cab7c8dc38a7a7733d0142b0409a68961 The issue has been resolved and predictions are now functioning normally. Investigating predictions creation issues https://replicatestatus.com/incident/281535 Thu, 02 Nov 2023 17:56:00 -0000 https://replicatestatus.com/incident/281535#f215c5ccabc45a69e8c6ee7e62355ac50a5fbeb009f3beb075867a0236b904d8 We have rolled out the fix and are monitoring the situation. Investigating predictions creation issues https://replicatestatus.com/incident/281535 Thu, 02 Nov 2023 17:23:00 -0000 https://replicatestatus.com/incident/281535#6f1628a33c2e726a41e0370dfb3460c89b8668e0dde58c701c7965c440051b3f We are seeing some predictions failing to be created, we are investigating. Replicate Web Internal Service Error https://replicatestatus.com/incident/281055 Wed, 01 Nov 2023 21:43:00 -0000 https://replicatestatus.com/incident/281055#487bc5ab1a5794ed9b56d2825cebc10235b51b423c9a3633c7cc2240e2cb2b11 Rollback of the problematic change has completed and Replicate website is now functioning normally again. Replicate Web Internal Service Error https://replicatestatus.com/incident/281055 Wed, 01 Nov 2023 21:31:00 -0000 https://replicatestatus.com/incident/281055#70e0f4d12130524df023608e7fd932a43f97601796e753077d08ea88f9f63c26 We have identified the root cause of the elevated errors with the replicate website. A rollback is under way to restore service. API access remains unaffected. Replicate Web Internal Service Error https://replicatestatus.com/incident/281055 Wed, 01 Nov 2023 21:26:00 -0000 https://replicatestatus.com/incident/281055#d027b47ed628de412133967562bf0afd385f46d50f1a43af9eee5114af56dd55 Replicate Web is currently seeing elevated error rate. SDXL Finetune errors https://replicatestatus.com/incident/280959 Wed, 01 Nov 2023 18:37:00 -0000 https://replicatestatus.com/incident/280959#037d2048053f15980a24e44fba4381a05ed40df8c5e03c8ee3bb2a7d2910de3e We have rolled out a fix and confirmed finetunes are working as expected. SDXL Finetune errors https://replicatestatus.com/incident/280959 Wed, 01 Nov 2023 16:30:00 -0000 https://replicatestatus.com/incident/280959#9218d7689b1263a8aaf36c9ef2425c9c8e926374ab2ef0f2873799e6d810bcb9 We have identified a fix that will retroactively correct issues with the fine tunes. We are in process of implementing the fix. We will provide additional updates as the fix is implemented and rolled out. SDXL Finetune errors https://replicatestatus.com/incident/280959 Wed, 01 Nov 2023 16:07:00 -0000 https://replicatestatus.com/incident/280959#dbc52a208236348a0310036e7db8aa6740d64e3db9aa796c2c23a894b1aa7566 We are aware of an issue with some recently trained (within the last 24 hours) SDXL fine-tunes based upon the replicate stability-ai/sdxl model is causing failures. We are working to address the issue. This is impacting a narrow set fine tunings against this specific model. Predictions and trainings degraded https://replicatestatus.com/incident/274913 Thu, 19 Oct 2023 14:30:00 -0000 https://replicatestatus.com/incident/274913#01b85850c01c5ae6f43195928e67e9906156deb703df04e3795986abd1ad667b Predictions and trainings are back to normal. Predictions and trainings degraded https://replicatestatus.com/incident/274913 Thu, 19 Oct 2023 14:30:00 -0000 https://replicatestatus.com/incident/274913#01b85850c01c5ae6f43195928e67e9906156deb703df04e3795986abd1ad667b Predictions and trainings are back to normal. Predictions and trainings degraded https://replicatestatus.com/incident/274913 Thu, 19 Oct 2023 12:58:00 -0000 https://replicatestatus.com/incident/274913#b710eda95c965e3c5d98304bec08d672d27084ed655110950eb1a1874da28cb3 We are seeing predictions and trainings degraded, they make take longer than usual to complete. We have identified the root cause and are rolling out the fix. Predictions and trainings degraded https://replicatestatus.com/incident/274913 Thu, 19 Oct 2023 12:58:00 -0000 https://replicatestatus.com/incident/274913#b710eda95c965e3c5d98304bec08d672d27084ed655110950eb1a1874da28cb3 We are seeing predictions and trainings degraded, they make take longer than usual to complete. We have identified the root cause and are rolling out the fix. replicate.com database maintenance https://replicatestatus.com/incident/270731 Tue, 10 Oct 2023 12:24:46 -0000 https://replicatestatus.com/incident/270731#8958ade1bf1f75ece257d5b8e09678677b008045246b6b21392c094c50844f5b We're performing some database maintenance on the replicate.com website. This may cause errors or slow responses for a few minutes. We expect the total impact to be no more than 5 minutes and will update this page if anything goes wrong. replicate.com database maintenance https://replicatestatus.com/incident/270731 Tue, 10 Oct 2023 12:15:00 +0000 https://replicatestatus.com/incident/270731#74f99cda2dd8d26e76ff719dfacc908a177cffc08c0d7b566d92d63c5e07e370 Maintenance completed replicate.com database maintenance https://replicatestatus.com/incident/270731 Tue, 10 Oct 2023 12:11:58 -0000 https://replicatestatus.com/incident/270731#8958ade1bf1f75ece257d5b8e09678677b008045246b6b21392c094c50844f5b We're performing some database maintenance on the replicate.com website. This may cause errors or slow responses for a few minutes. We expect the total impact to be no more than 5 minutes and will update this page if anything goes wrong. replicate.com database maintenance https://replicatestatus.com/incident/270731 Tue, 10 Oct 2023 12:00:00 -0000 https://replicatestatus.com/incident/270731#8958ade1bf1f75ece257d5b8e09678677b008045246b6b21392c094c50844f5b We're performing some database maintenance on the replicate.com website. This may cause errors or slow responses for a few minutes. We expect the total impact to be no more than 5 minutes and will update this page if anything goes wrong. Webhook delivery interrupted https://replicatestatus.com/incident/269948 Sun, 08 Oct 2023 20:13:00 -0000 https://replicatestatus.com/incident/269948#5f52bad648e54d57c572370dd0067521fa1492501b3214b4107ea401207f2bd7 We identified a problem affecting a small portion of customers -- slow responses to webhooks caused a backlog in processing outbound webhooks -- and have deployed a change to increase available webhook processing capacity. Webhook delivery is back to normal as of a few minutes ago. Webhook delivery interrupted https://replicatestatus.com/incident/269948 Sun, 08 Oct 2023 20:01:00 -0000 https://replicatestatus.com/incident/269948#fa05ada224ac6f2f29487994eb7c961148db41da697e444806fc7d2624989d83 We're looking into issues affecting webhook deliverability. Pushing of new versions is broken https://replicatestatus.com/incident/269120 Fri, 06 Oct 2023 11:20:00 -0000 https://replicatestatus.com/incident/269120#3a9df7804e0f85450cff28b17be5ab6b98cb7fc19c4fe053eaefb5006dad539f We've fixed the issue and you should be able to push new versions again. Pushing of new versions is broken https://replicatestatus.com/incident/269120 Fri, 06 Oct 2023 11:00:00 -0000 https://replicatestatus.com/incident/269120#5ecf3749461d0e93c5e8597d73ab30645f07239873254d140f80f00d5d2dcda3 New versions created when using `cog push` or when running a training are currently failing. We know the cause and are working on a fix which should be out shortly. Slow responses from replicate.com https://replicatestatus.com/incident/268601 Thu, 05 Oct 2023 07:29:00 -0000 https://replicatestatus.com/incident/268601#36ed73c846588f490fb7da5ecb0fdf6b6fd62acf139058942491ede9ed4ffc29 Database load is back to normal, performance should be back to usual levels. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 16:12:00 -0000 https://replicatestatus.com/incident/266073#74731af4e521053077da8f07949af1a4edbdc54dc8f3571d65676230f392e540 We have now fully resolved the issues and API and replicate.com website are fully operational. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 16:12:00 -0000 https://replicatestatus.com/incident/266073#74731af4e521053077da8f07949af1a4edbdc54dc8f3571d65676230f392e540 We have now fully resolved the issues and API and replicate.com website are fully operational. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 16:07:00 -0000 https://replicatestatus.com/incident/266073#b0ebe2d2cf4b68763c810cb8de79cc48ec9f94215a663cf71601a3e35f555472 Most of the impact of this incident has been mitigated and we are working on resolving the other issues. We are aware that an upstream provider outage resulted in cascading impact to the Replicate platform. This is not expected and we will be investigating in the coming days. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 16:07:00 -0000 https://replicatestatus.com/incident/266073#b0ebe2d2cf4b68763c810cb8de79cc48ec9f94215a663cf71601a3e35f555472 Most of the impact of this incident has been mitigated and we are working on resolving the other issues. We are aware that an upstream provider outage resulted in cascading impact to the Replicate platform. This is not expected and we will be investigating in the coming days. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 15:45:00 -0000 https://replicatestatus.com/incident/266073#9663cb64b04bf407cc995a5a3d0b114738049414c75e298810b09bbcfb38e1eb We are seeing predictions and trainings fail. While some predictions and trainings are processed slower than usual. This is due to issues with one cloud provider. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 15:45:00 -0000 https://replicatestatus.com/incident/266073#9663cb64b04bf407cc995a5a3d0b114738049414c75e298810b09bbcfb38e1eb We are seeing predictions and trainings fail. While some predictions and trainings are processed slower than usual. This is due to issues with one cloud provider. Web and predictions degraded https://replicatestatus.com/incident/266073 Fri, 29 Sep 2023 15:37:00 -0000 https://replicatestatus.com/incident/266073#caf42229523c462c966a115e25e8576decfb7c8386b38b0a286a6787fdb32dda Our API (and subsequently web-based predictions and trainings) is seeing higher than expected errors causing delays in starting predictions and trainings. Predictions and training degraded for one cloud provider https://replicatestatus.com/incident/265575 Thu, 28 Sep 2023 14:52:00 -0000 https://replicatestatus.com/incident/265575#890469f013d77ca06654f8af0ea93730896be63b5796278da8457cfee2b576b0 The API and website are now working as expected for predictions and trainings. Predictions and training degraded for one cloud provider https://replicatestatus.com/incident/265575 Thu, 28 Sep 2023 14:52:00 -0000 https://replicatestatus.com/incident/265575#890469f013d77ca06654f8af0ea93730896be63b5796278da8457cfee2b576b0 The API and website are now working as expected for predictions and trainings. Predictions and training degraded for one cloud provider https://replicatestatus.com/incident/265575 Thu, 28 Sep 2023 14:33:00 -0000 https://replicatestatus.com/incident/265575#10d053daf05ac80dbd9b626880056eb0ce478914333546444f249a39d2f179b7 Our API (and subsequently web-based predictions and trainings) is seeing higher than expected errors causing delays in starting predictions and trainings. Predictions and training degraded for one cloud provider https://replicatestatus.com/incident/265575 Thu, 28 Sep 2023 14:33:00 -0000 https://replicatestatus.com/incident/265575#10d053daf05ac80dbd9b626880056eb0ce478914333546444f249a39d2f179b7 Our API (and subsequently web-based predictions and trainings) is seeing higher than expected errors causing delays in starting predictions and trainings. Degraded API / API Errors https://replicatestatus.com/incident/265177 Wed, 27 Sep 2023 19:31:00 -0000 https://replicatestatus.com/incident/265177#3f76b7d5eb282bf9848a9cc2764d07229915286159cacdf808f8c0f9efc65a37 We have identified a problematic ingress pod and have caused it to reschedule. The API and website are now working as expected for predictions and trainings. Degraded API / API Errors https://replicatestatus.com/incident/265177 Wed, 27 Sep 2023 19:31:00 -0000 https://replicatestatus.com/incident/265177#3f76b7d5eb282bf9848a9cc2764d07229915286159cacdf808f8c0f9efc65a37 We have identified a problematic ingress pod and have caused it to reschedule. The API and website are now working as expected for predictions and trainings. Degraded API / API Errors https://replicatestatus.com/incident/265177 Wed, 27 Sep 2023 19:20:00 -0000 https://replicatestatus.com/incident/265177#9cdab609b453dca3d364e103741c56beac5ed3a1d238502aad21651fa8a94c24 Our API (and subsequently web-based predictions and trainings) is seeing higher than expected errors causing delays in starting predictions and trainings for some hardware types (A40 and some A100s). We are investigating the issue. Degraded API / API Errors https://replicatestatus.com/incident/265177 Wed, 27 Sep 2023 19:20:00 -0000 https://replicatestatus.com/incident/265177#9cdab609b453dca3d364e103741c56beac5ed3a1d238502aad21651fa8a94c24 Our API (and subsequently web-based predictions and trainings) is seeing higher than expected errors causing delays in starting predictions and trainings for some hardware types (A40 and some A100s). We are investigating the issue. Website downtime / API degraded https://replicatestatus.com/incident/265150 Wed, 27 Sep 2023 17:56:00 -0000 https://replicatestatus.com/incident/265150#e6f8ab509a557fe65020ca1ac985279c15c4290376e3bb41888a982a0f10df55 We have rolled back the problematic change. Website functionality has been restored and API error rate has returned to normal. Website downtime / API degraded https://replicatestatus.com/incident/265150 Wed, 27 Sep 2023 17:56:00 -0000 https://replicatestatus.com/incident/265150#e6f8ab509a557fe65020ca1ac985279c15c4290376e3bb41888a982a0f10df55 We have rolled back the problematic change. Website functionality has been restored and API error rate has returned to normal. Website downtime / API degraded https://replicatestatus.com/incident/265150 Wed, 27 Sep 2023 17:51:00 -0000 https://replicatestatus.com/incident/265150#52994d1f2b48d57060a27c1141c9a8e5c6b071f09e76469682323072f81ffb09 We are aware of an issue that caused the website to stop responding. We are rolling back the identified problematic code. API is seeing higher than expected error-rates as well. Website downtime / API degraded https://replicatestatus.com/incident/265150 Wed, 27 Sep 2023 17:51:00 -0000 https://replicatestatus.com/incident/265150#52994d1f2b48d57060a27c1141c9a8e5c6b071f09e76469682323072f81ffb09 We are aware of an issue that caused the website to stop responding. We are rolling back the identified problematic code. API is seeing higher than expected error-rates as well. Slow start on some predictions and trainings (A40 and some A100) https://replicatestatus.com/incident/262409 Thu, 21 Sep 2023 21:06:00 -0000 https://replicatestatus.com/incident/262409#436e50f0bfc52ba2ce7ffe948e0ca868c8ea0d5cdf1bd293a11d1dd81eb00af2 We have worked through the pending predictions and trainings and now see normal start times. System unavailable https://replicatestatus.com/incident/262521 Thu, 21 Sep 2023 20:37:00 -0000 https://replicatestatus.com/incident/262521#e3a6f079c81c1ad0314b65b1aab0c1f28ed4bb218e4616163f4e4b4060fa2f22 We have recovered our caching service and see predictions and training succeeding. System unavailable https://replicatestatus.com/incident/262521 Thu, 21 Sep 2023 20:37:00 -0000 https://replicatestatus.com/incident/262521#e3a6f079c81c1ad0314b65b1aab0c1f28ed4bb218e4616163f4e4b4060fa2f22 We have recovered our caching service and see predictions and training succeeding. System unavailable https://replicatestatus.com/incident/262521 Thu, 21 Sep 2023 20:27:00 -0000 https://replicatestatus.com/incident/262521#c4ac935d6d3b926c3c11bc52c8571d1c445724b7c98ee5fefe84d047de253533 We are aware of an incident that has created instability in the API and Website. Predictions and trainings are impacted and may not run. Website and API is will likely show errors until recovery is complete. System unavailable https://replicatestatus.com/incident/262521 Thu, 21 Sep 2023 20:27:00 -0000 https://replicatestatus.com/incident/262521#c4ac935d6d3b926c3c11bc52c8571d1c445724b7c98ee5fefe84d047de253533 We are aware of an incident that has created instability in the API and Website. Predictions and trainings are impacted and may not run. Website and API is will likely show errors until recovery is complete. Slow start on some predictions and trainings (A40 and some A100) https://replicatestatus.com/incident/262409 Thu, 21 Sep 2023 19:33:00 -0000 https://replicatestatus.com/incident/262409#46ab43295bc074e75bb5721522b72f18061a5caf6607b64bbb8637f09e9bcae8 There has been improvements to the start times. We are still seeing delays in starts for predictions and trainings on A40 and some A100 GPUs; we are continuing to work through the pending predictions and trainings. API and Website in general remain responsive and available outside of the affected GPU targets. Slow start on some predictions and trainings (A40 and some A100) https://replicatestatus.com/incident/262409 Thu, 21 Sep 2023 18:29:00 -0000 https://replicatestatus.com/incident/262409#2c5b321f5e0c38de56f2e43149fa4517274e79a5ca91a989549b49f8d30e18d7 We continue to see slow starts for A40 and some A100 workloads. We are continuing to work through pending predictions and trainings for these hardware types. Slow start on some predictions and trainings (A40 and some A100) https://replicatestatus.com/incident/262409 Thu, 21 Sep 2023 17:13:00 -0000 https://replicatestatus.com/incident/262409#29c6fd1bb901fc16113881c039b66d59df3c8c9136eee80922e8c7c1071e5401 We have identified the root cause and are working to clear the backlog of pending predictions and trainings. Slow start on some predictions and trainings (A40 and some A100) https://replicatestatus.com/incident/262409 Thu, 21 Sep 2023 16:14:00 -0000 https://replicatestatus.com/incident/262409#39e39815bf027d7c45d456f41da5cba446eae2b5916ad620aaf1cba37d029bdb We are aware of an issue with some GPU targets (A40) taking longer than expected to start predictions and trainings. We are investigating the issue. API and Web are otherwise functioning as normal. Temporary capacity issues with 8xA40 hardware type https://replicatestatus.com/incident/262002 Thu, 21 Sep 2023 00:39:00 -0000 https://replicatestatus.com/incident/262002#e5b75935869d5fcb0a67bcaf1cc73e0ed6f4f8e1200b52f3a22665b63718753c We resolved the capacity issues. Temporary capacity issues with 8xA40 hardware type https://replicatestatus.com/incident/262002 Wed, 20 Sep 2023 22:56:00 -0000 https://replicatestatus.com/incident/262002#377f3d8b544ffd06a3c25726e8ea6754b6f070524be647a2710b4155da63c6ef We have identified the problem, and capacity issues are slowly resolving themselves. We will monitor and update the status once things are fully resolved. Temporary capacity issues with 8xA40 hardware type https://replicatestatus.com/incident/262002 Wed, 20 Sep 2023 21:59:00 -0000 https://replicatestatus.com/incident/262002#e68b97afbd20b90376510fbe935f7fc523eb7f910247ba919211f76ed9fa8eeb We are having a temporary dip in capacity of 8xA40 hardware type. Predictions and trainings on this hardware can be expected to take for much longer than usual. Primary database outage https://replicatestatus.com/incident/261484 Tue, 19 Sep 2023 21:17:00 -0000 https://replicatestatus.com/incident/261484#806f39156cc337948f3ea98144fcc0cc7d59daf04cb14df2def47b25dbb610f8 Both the API and web are now back to normal. Predictions, trainings are functioning as expected. We are continuing to monitor things. Primary database outage https://replicatestatus.com/incident/261484 Tue, 19 Sep 2023 21:17:00 -0000 https://replicatestatus.com/incident/261484#806f39156cc337948f3ea98144fcc0cc7d59daf04cb14df2def47b25dbb610f8 Both the API and web are now back to normal. Predictions, trainings are functioning as expected. We are continuing to monitor things. Primary database outage https://replicatestatus.com/incident/261484 Tue, 19 Sep 2023 21:03:00 -0000 https://replicatestatus.com/incident/261484#a1344d853f08f511711d5706a778b33223587b8d01360b96e467663cf78c86d2 We identified and rolled out a potential solution and we see API and web improving. Primary database outage https://replicatestatus.com/incident/261484 Tue, 19 Sep 2023 21:03:00 -0000 https://replicatestatus.com/incident/261484#a1344d853f08f511711d5706a778b33223587b8d01360b96e467663cf78c86d2 We identified and rolled out a potential solution and we see API and web improving. Primary database outage https://replicatestatus.com/incident/261484 Tue, 19 Sep 2023 19:45:00 -0000 https://replicatestatus.com/incident/261484#c0b9f5d8158a1d5a7bd307d8072f3bc8d628c8c3905fefef3ac442ea73741d4f More parts of web and api are degraded. We have identified unusually high load on our primary database and this is causing problems throughout replicate.com and API. Predictions, trainings are not working correctly via replicate or API. We are working on resolving it. Primary database outage https://replicatestatus.com/incident/261484 Tue, 19 Sep 2023 19:45:00 -0000 https://replicatestatus.com/incident/261484#c0b9f5d8158a1d5a7bd307d8072f3bc8d628c8c3905fefef3ac442ea73741d4f More parts of web and api are degraded. We have identified unusually high load on our primary database and this is causing problems throughout replicate.com and API. Predictions, trainings are not working correctly via replicate or API. We are working on resolving it. API degraded https://replicatestatus.com/incident/260893 Mon, 18 Sep 2023 16:50:00 -0000 https://replicatestatus.com/incident/260893#f724b66d48838cf7cbbd96c1919e6e454e71cf9ab8ec56fc6086600d5f374cfa API is now behaving normally. API degraded https://replicatestatus.com/incident/260893 Mon, 18 Sep 2023 16:23:00 -0000 https://replicatestatus.com/incident/260893#949d114ac1bfa946892e2a05ecd2bd598c05070006405bd902ea3ac11aa5e260 Our API is currently in a degraded state, generating more errors than usual. We are investigating. Web and predictions degraded https://replicatestatus.com/incident/256732 Fri, 08 Sep 2023 12:37:00 -0000 https://replicatestatus.com/incident/256732#e10cb37ddc3fbdd40499794d84936e05307b616b7560744effc0ac61b5e5d9a4 Everything is resolved and back to normal. During the downtime predictions were completing normally in API, but are not persisted. Web and predictions degraded https://replicatestatus.com/incident/256732 Fri, 08 Sep 2023 12:37:00 -0000 https://replicatestatus.com/incident/256732#e10cb37ddc3fbdd40499794d84936e05307b616b7560744effc0ac61b5e5d9a4 Everything is resolved and back to normal. During the downtime predictions were completing normally in API, but are not persisted. Web and predictions degraded https://replicatestatus.com/incident/256732 Fri, 08 Sep 2023 11:14:00 -0000 https://replicatestatus.com/incident/256732#2f8db75ad133d9b4e54dff34df257e2e9e593dcef753ab0e4593449eec82485b Web was down for most users. API continued to be functional but no predictions are persisted during the downtime. Web and predictions degraded https://replicatestatus.com/incident/256732 Fri, 08 Sep 2023 11:14:00 -0000 https://replicatestatus.com/incident/256732#2f8db75ad133d9b4e54dff34df257e2e9e593dcef753ab0e4593449eec82485b Web was down for most users. API continued to be functional but no predictions are persisted during the downtime. Degraded Prediction and Training Start Times https://replicatestatus.com/incident/256366 Thu, 07 Sep 2023 20:06:00 -0000 https://replicatestatus.com/incident/256366#866e5edc13c29d27a5fcde415cdcaf99feca454d0cae514aa5461a118a39b577 The issue with the upstream provider has been resolved. Predictions and Trainings are expected to be starting within normal timeframes. Degraded Prediction and Training Start Times https://replicatestatus.com/incident/256366 Thu, 07 Sep 2023 20:06:00 -0000 https://replicatestatus.com/incident/256366#866e5edc13c29d27a5fcde415cdcaf99feca454d0cae514aa5461a118a39b577 The issue with the upstream provider has been resolved. Predictions and Trainings are expected to be starting within normal timeframes. Degraded Prediction and Training Start Times https://replicatestatus.com/incident/256366 Thu, 07 Sep 2023 16:13:00 -0000 https://replicatestatus.com/incident/256366#f8e5207b728b26022369a607155b482a204c0308a4d613799051b850f441e58b Prediction and Training start times are degraded in some scenarios. We are aware of an issue with one of our providers and working with them to correct the problem API responses and Web responses remain unaffected Degraded Prediction and Training Start Times https://replicatestatus.com/incident/256366 Thu, 07 Sep 2023 16:13:00 -0000 https://replicatestatus.com/incident/256366#f8e5207b728b26022369a607155b482a204c0308a4d613799051b850f441e58b Prediction and Training start times are degraded in some scenarios. We are aware of an issue with one of our providers and working with them to correct the problem API responses and Web responses remain unaffected Degraded Prediction Handling https://replicatestatus.com/incident/255839 Wed, 06 Sep 2023 16:15:00 -0000 https://replicatestatus.com/incident/255839#dd683becb2e225049dd791d7e02c01e2d5a27abb5c15d9c8800c44d7c5285069 Prediction processing and prediction are working as expected now. Degraded Prediction Handling https://replicatestatus.com/incident/255839 Wed, 06 Sep 2023 16:15:00 -0000 https://replicatestatus.com/incident/255839#dd683becb2e225049dd791d7e02c01e2d5a27abb5c15d9c8800c44d7c5285069 Prediction processing and prediction are working as expected now. Degraded Prediction Handling https://replicatestatus.com/incident/255839 Wed, 06 Sep 2023 16:04:00 -0000 https://replicatestatus.com/incident/255839#289d65b09335fbd7217114748e644febd5e6780e2b64a2df0ce9a371f46efca9 We are aware of degradation in prediction handling and creation. We are currently investigating. Degraded Prediction Handling https://replicatestatus.com/incident/255839 Wed, 06 Sep 2023 16:04:00 -0000 https://replicatestatus.com/incident/255839#289d65b09335fbd7217114748e644febd5e6780e2b64a2df0ce9a371f46efca9 We are aware of degradation in prediction handling and creation. We are currently investigating. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 20:45:00 -0000 https://replicatestatus.com/incident/253830#4dc8cabb2608d966324ff1c6972a78d9e19915dd4937e18091f4795101e3e511 Everything should be working normally at this time. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 20:45:00 -0000 https://replicatestatus.com/incident/253830#4dc8cabb2608d966324ff1c6972a78d9e19915dd4937e18091f4795101e3e511 Everything should be working normally at this time. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 20:16:00 -0000 https://replicatestatus.com/incident/253830#6bdfa3e8068729105b6f54d96c01cb9d3d32ef4f14fe05c8f1b04eb7b6a712c1 Backlog of predictions should be cleared at this time. Engineers are rolling out a restart of an internal service to ensure all things are working as expected. Predictions and Trainings are expected to be working normally at this time. Incident will be closed as soon as we confirm the restart has completed. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 20:16:00 -0000 https://replicatestatus.com/incident/253830#6bdfa3e8068729105b6f54d96c01cb9d3d32ef4f14fe05c8f1b04eb7b6a712c1 Backlog of predictions should be cleared at this time. Engineers are rolling out a restart of an internal service to ensure all things are working as expected. Predictions and Trainings are expected to be working normally at this time. Incident will be closed as soon as we confirm the restart has completed. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 19:11:00 -0000 https://replicatestatus.com/incident/253830#349c72f8ff796b0251b2f8843b4126a1bb1dc0c74b5c4111d2aa2f671298ebb6 We are still working through the backlog of problematic workloads and predictions. Errors will continue to occur in some circumstances both from web and via API. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 19:11:00 -0000 https://replicatestatus.com/incident/253830#349c72f8ff796b0251b2f8843b4126a1bb1dc0c74b5c4111d2aa2f671298ebb6 We are still working through the backlog of problematic workloads and predictions. Errors will continue to occur in some circumstances both from web and via API. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 17:00:00 -0000 https://replicatestatus.com/incident/253830#4669925b0cea2cf89442ec571cd19c1897378f9fe47f2696095500a527286d3a We have remediated the issue and now newly created predictions are working. Engineers are working to clear out the backlog of predictions that failed to start. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 17:00:00 -0000 https://replicatestatus.com/incident/253830#4669925b0cea2cf89442ec571cd19c1897378f9fe47f2696095500a527286d3a We have remediated the issue and now newly created predictions are working. Engineers are working to clear out the backlog of predictions that failed to start. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 16:46:00 -0000 https://replicatestatus.com/incident/253830#73de52c92dc685b74f56f5edb4762a1b79a2e7fc373520df94be717e8c363d86 The team is aware of an issue that is causing failures starting predictions. Engineers are working on correcting the root cause. New predictions will fail to start in some cases both through web and API. General web usage is not impacted. Currently running predictions will continue to work. Issues starting predictions https://replicatestatus.com/incident/253830 Fri, 01 Sep 2023 16:46:00 -0000 https://replicatestatus.com/incident/253830#73de52c92dc685b74f56f5edb4762a1b79a2e7fc373520df94be717e8c363d86 The team is aware of an issue that is causing failures starting predictions. Engineers are working on correcting the root cause. New predictions will fail to start in some cases both through web and API. General web usage is not impacted. Currently running predictions will continue to work. Issues scheduling to certain hardware https://replicatestatus.com/incident/248009 Sat, 19 Aug 2023 23:41:00 -0000 https://replicatestatus.com/incident/248009#9375da4a14323f4e7eafbbab6bd127772c15df933299d8597129cfc13157ab4a Thank you for your patience. At this time all hung workloads targeted for the T4 hardware should no longer be stuck in starting phase. Issues scheduling to certain hardware https://replicatestatus.com/incident/248009 Sat, 19 Aug 2023 21:39:00 -0000 https://replicatestatus.com/incident/248009#f74c67ad6f2fbe2c9b885a2d9e5964b6768deabeb690b80f24d7b33e69319e93 We have identified the cause of the delays scheduling workloads to the T4 hardware. We have performed a workaround and are seeing workloads successfully complete at this time. We will update further once pending workloads have cleared. Issues scheduling to certain hardware https://replicatestatus.com/incident/248009 Sat, 19 Aug 2023 20:20:00 -0000 https://replicatestatus.com/incident/248009#838d577a726e8a6df09dfcfa87af7cf5a3d64c91ec675fa2f6eed134b850d112 We are aware and investigating an issue impacting the ability to schedule predictions and trainings to certain hardware. Workloads affected appear to stay in the "starting" phase for extended periods of time. This is primarily impacting the use of the T4 GPUs. Other hardware types should be unaffected. This is not directly impacting responses when interacting with the website or API. Replicate Web Down https://replicatestatus.com/incident/247568 Fri, 18 Aug 2023 15:22:00 -0000 https://replicatestatus.com/incident/247568#99f0089195d2ee481c218870748ea81daaea349dcb22526c40674cd0901a68fd Engineers have rolled back a change to the website and at this time the website should now be responding as expected. Replicate Web Down https://replicatestatus.com/incident/247568 Fri, 18 Aug 2023 15:16:00 -0000 https://replicatestatus.com/incident/247568#2855b1830dc902ff7209429fb4580508fc2509f4c8b26c4001ba4e575c99b456 Some users might see degraded performance of web. API remains fully functional. Replicate Web Down https://replicatestatus.com/incident/247568 Fri, 18 Aug 2023 15:02:00 -0000 https://replicatestatus.com/incident/247568#8683c5dcbbfb922ddbd1f5e60573ad38e21fb6d170919ddb7bbf820dfd35e3ed We are fully operational. Replicate Web Down https://replicatestatus.com/incident/247568 Fri, 18 Aug 2023 14:58:00 -0000 https://replicatestatus.com/incident/247568#d4f64c8e125e35df44bb10912955a44ac44b822a9a78a99343f3109de9b76fcd We have identified the root cause and have rolled out the fix. Some users might still experience degraded web performance. Replicate Web Down https://replicatestatus.com/incident/247568 Fri, 18 Aug 2023 14:49:00 -0000 https://replicatestatus.com/incident/247568#d11120c381b5c0c5d80edd634fee07dba80e9bd56cf944c48932ac991e267533 We are investigating an incident that has caused the Replicate web to stop responding. API requests are expected to continue to respond normally. Webside and API Outage https://replicatestatus.com/incident/246194 Tue, 15 Aug 2023 19:28:00 -0000 https://replicatestatus.com/incident/246194#6c7e2080ceae2de327fd2cf9c007dbe361df28950b49a84fa5ec98ddda971962 Reverting the identified change and purging known bad cache values has resolved the error rate within the API service. API and Web should be responding as expected at this time. Webside and API Outage https://replicatestatus.com/incident/246194 Tue, 15 Aug 2023 19:28:00 -0000 https://replicatestatus.com/incident/246194#6c7e2080ceae2de327fd2cf9c007dbe361df28950b49a84fa5ec98ddda971962 Reverting the identified change and purging known bad cache values has resolved the error rate within the API service. API and Web should be responding as expected at this time. Webside and API Outage https://replicatestatus.com/incident/246194 Tue, 15 Aug 2023 19:21:00 -0000 https://replicatestatus.com/incident/246194#065ec38b2edb2606e0c8367d576745345cbfdc73457dae147584f3e2bfb67ebc We have identified a problematic change that is causing errors within the API service. Changes have been made to mitigate the issue. The website and API are beginning to respond as expected however error rates are still elevated. Webside and API Outage https://replicatestatus.com/incident/246194 Tue, 15 Aug 2023 19:21:00 -0000 https://replicatestatus.com/incident/246194#065ec38b2edb2606e0c8367d576745345cbfdc73457dae147584f3e2bfb67ebc We have identified a problematic change that is causing errors within the API service. Changes have been made to mitigate the issue. The website and API are beginning to respond as expected however error rates are still elevated. Webside and API Outage https://replicatestatus.com/incident/246194 Tue, 15 Aug 2023 18:53:00 -0000 https://replicatestatus.com/incident/246194#4f9283ea0b47007eb3dd1c04c1dd3cd056fc46ed48e4fc7580011830c4f41354 We are investigating an issue with the Replicate website and API. Webside and API Outage https://replicatestatus.com/incident/246194 Tue, 15 Aug 2023 18:53:00 -0000 https://replicatestatus.com/incident/246194#4f9283ea0b47007eb3dd1c04c1dd3cd056fc46ed48e4fc7580011830c4f41354 We are investigating an issue with the Replicate website and API. Delays starting some models https://replicatestatus.com/incident/244571 Fri, 11 Aug 2023 15:13:00 -0000 https://replicatestatus.com/incident/244571#78bdfe2e7ac616a261a631126cb524853ad1191ad1e5e02fdcb1e84619b1dcab We believe that as of a few minutes ago the last customer impact from this issue has been resolved and all queues have cleared. To help you correlate this incident with any issues you may have seen: as far as we can tell the earliest customer impact from this incident started at about 11:00 UTC today. Delays starting some models https://replicatestatus.com/incident/244571 Fri, 11 Aug 2023 13:38:00 -0000 https://replicatestatus.com/incident/244571#22032d560cb456de7456fcb5c4c6e1ee84db294605f7153b56a2488c5829ff39 We believe we have identified and updated the cluster autoscaler. The system is currently working through the backlog of requests. Delays starting some models https://replicatestatus.com/incident/244571 Fri, 11 Aug 2023 13:25:00 -0000 https://replicatestatus.com/incident/244571#dfa4b0606154d05a1c94398346ccc0cfe6560fc6690671e72707ef8242089e82 We are identified an problem with our cluster autoscaler and are continuing to investigate. Delays starting some models https://replicatestatus.com/incident/244571 Fri, 11 Aug 2023 13:03:00 -0000 https://replicatestatus.com/incident/244571#aa9421823ad9191b0c10546794446c5b36166bec3d7b651b037e248d81161044 We're aware of scheduling delays affecting some models and are investigating the issue. Models not booting https://replicatestatus.com/incident/243677 Wed, 09 Aug 2023 22:14:00 -0000 https://replicatestatus.com/incident/243677#b1828a776bf7ae5947850a896a352cbadf798c539268c6bb764b66f6fb525e31 The fix has been rolled out for all models. Models not booting https://replicatestatus.com/incident/243677 Wed, 09 Aug 2023 22:14:00 -0000 https://replicatestatus.com/incident/243677#b1828a776bf7ae5947850a896a352cbadf798c539268c6bb764b66f6fb525e31 The fix has been rolled out for all models. Models not booting https://replicatestatus.com/incident/243677 Wed, 09 Aug 2023 20:45:00 -0000 https://replicatestatus.com/incident/243677#dcb9b190a16717be886c6cc251b2e34b48600b223c0cd7e34be9c1a605ea4634 We have identified the cause and are rolling out a fix now. It may take 30–60 minutes to fully roll out. Models not booting https://replicatestatus.com/incident/243677 Wed, 09 Aug 2023 20:45:00 -0000 https://replicatestatus.com/incident/243677#dcb9b190a16717be886c6cc251b2e34b48600b223c0cd7e34be9c1a605ea4634 We have identified the cause and are rolling out a fix now. It may take 30–60 minutes to fully roll out. Models not booting https://replicatestatus.com/incident/243677 Wed, 09 Aug 2023 20:22:00 -0000 https://replicatestatus.com/incident/243677#583a72fad548efbfb5136749ff5b1e581b27c51c6c466ce58486abcb97f6474e We're investigating an issue preventing models from booting. Model instances that are already running appear to be continuing to run, but new models or new instances for existing models are not able to start. Models not booting https://replicatestatus.com/incident/243677 Wed, 09 Aug 2023 20:22:00 -0000 https://replicatestatus.com/incident/243677#583a72fad548efbfb5136749ff5b1e581b27c51c6c466ce58486abcb97f6474e We're investigating an issue preventing models from booting. Model instances that are already running appear to be continuing to run, but new models or new instances for existing models are not able to start. 520 error responses from API https://replicatestatus.com/incident/240960 Thu, 03 Aug 2023 13:48:00 -0000 https://replicatestatus.com/incident/240960#9e73c710aefce1e325f2b91ab4f3b11ad65d0a9e542d7b51c14e988738f4f684 We've identified the source of the errors -- a global load balancing service appears to have been misbehaving -- and made changes to how we serve api.replicate.com to mitigate the problem. As of a few minutes ago, we are no longer serving 520 error responses to customers. 520 error responses from API https://replicatestatus.com/incident/240960 Thu, 03 Aug 2023 10:45:00 -0000 https://replicatestatus.com/incident/240960#b493570570c5e9cb25db2a8bba98b79c52eb9e9e4a0a221bf16ce1d0c8580b7c We're aware of a problem affecting ~1% of requests to api.replicate.com, which are receiving HTTP 520 responses. We're investigating the problem. Replicate website unavailable https://replicatestatus.com/incident/240645 Wed, 02 Aug 2023 16:48:00 -0000 https://replicatestatus.com/incident/240645#13dbd382595db6c7bfbc5e94a4fe5d5cebabb4cc18207a3bdb6b65fbcae783d6 We're back! We pushed a bad change and have rolled it back. Sorry for the inconvenience. Replicate website unavailable https://replicatestatus.com/incident/240645 Wed, 02 Aug 2023 16:41:00 -0000 https://replicatestatus.com/incident/240645#d16135145e88713993236d08c9ec084cff5326100d7d95a51ad4d1c5f6c9845d The replicate.com website is having trouble serving requests and most users will be experiencing timeouts or errors. We're investigating. Prediction requests failing https://replicatestatus.com/incident/239647 Mon, 31 Jul 2023 12:15:00 -0000 https://replicatestatus.com/incident/239647#0c014f37a94eb554131445d543098d747a146d7eba50c6187480f886d21f8d95 All prediction requests are now responding normally. We're still investigating the underlying cause. Prediction requests failing https://replicatestatus.com/incident/239647 Mon, 31 Jul 2023 12:07:00 -0000 https://replicatestatus.com/incident/239647#14826f14974f41d1ed44955183ff07bfad720fdab3f9c900f1dac133f23d674b Prediction requests for some models are currently failing. API errors/timeouts https://replicatestatus.com/incident/238656 Fri, 28 Jul 2023 18:10:00 -0000 https://replicatestatus.com/incident/238656#4ff083fc28137a125dc1a5818514920fefb9c758c4ce6fef5c73344341533201 The API is fully recovered. Unfortunately we are still at least partially in the dark about what triggered these problems. We're continuing to investigate. API errors/timeouts https://replicatestatus.com/incident/238656 Fri, 28 Jul 2023 18:05:00 -0000 https://replicatestatus.com/incident/238656#2d1c381b44d15e553db637960f15aadd414978e680749ebec7a8eee2579fa25c The situation is improving, but we're aware that some customers are still experiencing this issue. We're continuing to investigate. API errors/timeouts https://replicatestatus.com/incident/238656 Fri, 28 Jul 2023 17:20:00 -0000 https://replicatestatus.com/incident/238656#07c86493a6e452ca263c489b6d26841d94ea3addbfa73b253362ff1aa8f30cc6 We're aware that some API requests are again seeing errors and timeouts on some requests. We're actively looking into what's causing this. API errors/timeouts https://replicatestatus.com/incident/238405 Fri, 28 Jul 2023 05:32:00 -0000 https://replicatestatus.com/incident/238405#2df70239e4146d80f6d3da2a19fc13d3e18cda764090f21396037ce6b2a4056d Services have recovered. We'll be following up with our provider to understand how the scope of the planned maintenance expanded to affect customer workloads. API errors/timeouts https://replicatestatus.com/incident/238405 Fri, 28 Jul 2023 05:25:00 -0000 https://replicatestatus.com/incident/238405#b0b349cabb4307e46a2d2f3efd7fa9c466d6c8963414d7e6e83186c0f6454372 Services are starting to recover. We continue to monitor the situation. API errors/timeouts https://replicatestatus.com/incident/238405 Fri, 28 Jul 2023 05:18:00 -0000 https://replicatestatus.com/incident/238405#c9e8db5c7a39618bd6d4f05e1c4e07232e2fa5bfcff20ad95ab3ebb82fa03e3a We're in communication with our provider and are working to restore service as soon as possible. In the mean time we're taking steps to shift traffic to other providers. API errors/timeouts https://replicatestatus.com/incident/238405 Fri, 28 Jul 2023 04:52:00 -0000 https://replicatestatus.com/incident/238405#962e7c1f7cd7debf2cfcf64688c08e0eddef671a978f6e3ca08ec19523a4e7dc We've identified the problem: planned network maintenance by a provider is having broader impact than expected. We are working to restore service. API errors/timeouts https://replicatestatus.com/incident/238405 Fri, 28 Jul 2023 04:30:00 -0000 https://replicatestatus.com/incident/238405#7b0139f522f379db7d1d578ab9a6f7788fa244cc63f300e20fdd716493bd1075 We're investigating errors and timeouts being returned by our API. API errors https://replicatestatus.com/incident/237579 Wed, 26 Jul 2023 10:51:00 -0000 https://replicatestatus.com/incident/237579#54914484d444e7826d80dcd6ca74707ebea37cb9729252b2ec1917419b0d54fd We've identified a service that was starved of compute resources and addressed that problem. Service has been restored. API errors https://replicatestatus.com/incident/237579 Wed, 26 Jul 2023 10:44:00 -0000 https://replicatestatus.com/incident/237579#ce88d04c7d6437998f1bb70d1b920687d10f051a3a45386b27fc03065ff0c181 We're investigating issues with slow responses and HTTP errors from the Replicate API. Prediction creation errors https://replicatestatus.com/incident/234969 Wed, 19 Jul 2023 18:35:00 -0000 https://replicatestatus.com/incident/234969#e3514c01b9e941af588301622b48838ad4e4690012462fe279726214e6c37c28 We've restored service to the queueing system and predictions are flowing again. Prediction creation errors https://replicatestatus.com/incident/234969 Wed, 19 Jul 2023 18:25:00 -0000 https://replicatestatus.com/incident/234969#6042dbee6c20302ddc03753739bfa197f2b990facb8e57679690d10293f8f422 We've identified what's gone wrong (an internal queueing system is experiencing a disk error) and we're working on restoring service. Prediction creation errors https://replicatestatus.com/incident/234969 Wed, 19 Jul 2023 18:12:00 -0000 https://replicatestatus.com/incident/234969#398b933a7dd83bc670f0923a7119b7b3c7b3c84632242599fd800fd3a6fede6c We're investigating issues with prediction creation. Delayed prediction start times https://replicatestatus.com/incident/225394 Tue, 27 Jun 2023 16:35:00 -0000 https://replicatestatus.com/incident/225394#35187f618bf04830620f9d37921ad0feb67cf4e93d59c114b3175ee8d43d6855 Predictions are flowing as expected once again. We'll continue to monitor the situation. Delayed prediction start times https://replicatestatus.com/incident/225394 Tue, 27 Jun 2023 16:00:00 -0000 https://replicatestatus.com/incident/225394#e30ff01f51af1687963c7b5ed508614ec142c418f2eeb055f009ba9af4b1767c We're aware that some users are seeing delayed prediction start times for some models. This is due to an ongoing infrastructure provider outage and we are monitoring the situation closely. Web and API failures https://replicatestatus.com/incident/223885 Fri, 23 Jun 2023 15:45:00 -0000 https://replicatestatus.com/incident/223885#33ef54665abfae947a8e54afc52a9601ad84367ff4fd525351ee5b2ba58a253f The rollback fixed things and we're back to normal. Web and API failures https://replicatestatus.com/incident/223885 Fri, 23 Jun 2023 15:45:00 -0000 https://replicatestatus.com/incident/223885#33ef54665abfae947a8e54afc52a9601ad84367ff4fd525351ee5b2ba58a253f The rollback fixed things and we're back to normal. Web and API failures https://replicatestatus.com/incident/223885 Fri, 23 Jun 2023 15:38:00 -0000 https://replicatestatus.com/incident/223885#ab71df7dba87dfc101c04f3426dff3c287a717853cac0f798d50f7fea6a56d6b We pushed a change that's causing some endpoints on Replicate's website and API to return 500 Server Error responses. We've already triggered a rollback and will be monitoring the situation closely for recovery. Web and API failures https://replicatestatus.com/incident/223885 Fri, 23 Jun 2023 15:38:00 -0000 https://replicatestatus.com/incident/223885#ab71df7dba87dfc101c04f3426dff3c287a717853cac0f798d50f7fea6a56d6b We pushed a change that's causing some endpoints on Replicate's website and API to return 500 Server Error responses. We've already triggered a rollback and will be monitoring the situation closely for recovery. Garbled/corrupted responses https://replicatestatus.com/incident/223082 Wed, 21 Jun 2023 16:10:00 -0000 https://replicatestatus.com/incident/223082#866467d65a1e87f4da122e5f56c43713a7d03953b07712bd1e23d481abbcf757 We've identified what's causing this issue and have rolled back the change. Affected predictions will have completed successfully and you can re-request their status through the API. Garbled/corrupted responses https://replicatestatus.com/incident/223082 Wed, 21 Jun 2023 14:53:00 -0000 https://replicatestatus.com/incident/223082#c3fa173709e30a4f88602a0b790dfddf3f521d15c0860bc2c2b0c2ea61ce4a3a We're investigating a problem where users are receiving garbled responses from requests to the API. Model autoscaling degraded https://replicatestatus.com/incident/216435 Mon, 05 Jun 2023 23:23:00 -0000 https://replicatestatus.com/incident/216435#90bc84f18a6438e3e06f192c54f0b0ca2f66680ac33400a22c9b38b35a92cf7a Autoscaling issues have been resolved for all models and everything should be operating normally. Model autoscaling degraded https://replicatestatus.com/incident/216435 Mon, 05 Jun 2023 21:26:00 -0000 https://replicatestatus.com/incident/216435#26c3ff7ac37c9d0b01964289e1d519bc6e8871f80c6a8d747d628332b218681a We're working to resolve an issue preventing models from autoscaling. Prediction webhook delivery interrupted https://replicatestatus.com/incident/211616 Wed, 24 May 2023 19:36:00 -0000 https://replicatestatus.com/incident/211616#a60c5151f930afb4127aefd5bd2952b160a2e69cffb92368b267ee05ea00859c Prediction webhooks are now working as intended. We made a change intended to fix a subtle bug in webhook handling, and unfortunately introduced a much less subtle bug: "completed" webhooks were dropped and not correctly delivered. We rolled that change back and are now working on a permanent fix for both bugs. Prediction webhook delivery interrupted https://replicatestatus.com/incident/211616 Wed, 24 May 2023 19:20:00 -0000 https://replicatestatus.com/incident/211616#64c9f6229b0c3632d97f0ce81d7ff3e710239c5fa85fa82f80818bb18c287da6 We are investigating problems with prediction webhook delivery. 500s and slow responses from replicate.com website https://replicatestatus.com/incident/210923 Tue, 23 May 2023 09:35:00 -0000 https://replicatestatus.com/incident/210923#6b4cca829e10861d910c9bb2d5c5f305119ff58582ba8bf18a25cf8a6aad60a2 The database issue was automatically resolved. We made some changes to our database schema in the wrong order, and this resulted in a brief interruption of service for the replicate.com website. Running predictions was not affected. 500s and slow responses from replicate.com website https://replicatestatus.com/incident/210923 Tue, 23 May 2023 09:30:00 -0000 https://replicatestatus.com/incident/210923#92677053503d0f22a4acd4cf08ec18434e0accb271a5dd88e9e3136264d2bd79 The replicate.com website is serving 500s and slow responses.