Resolved
Fully resolved. No more latency impact in production. Post-mortem to come.
Monitoring
We moved off of the previous cloud provider as we were seeing high queue times there. It appears that latency is now within a normal range, though on the new cloud provider ~50% higher than our normal levels, we are continuing to investigate root cause and a way to get back to normal latency without downtime.
Monitoring
Continuing to monitor and seeing increased latency but jobs completing at a regular cadence. Continuing to investigate latency impact.
Monitoring
We are seeing 503s reduced at the moment after applying a mitigation and are monitoring for any additional issues.
Investigating
We are investigating elevated 503 and pending events on the API.