Delayed analysis
Incident Report for Code Climate
Resolved
After detaching the bad node from the ASG, the CPU utilization of the group is now accurate and any auto-scaling actions should kick in as usual. We'll continue investigation tomorrow and determine how to prevent a node that fails to come up properly from reporting as healthy in the ASG and causing a problem like this in the future.
Posted Jun 23, 2016 - 03:25 EDT
Monitoring
We brought on 2 new worker nodes and queue times have returned to normal. It appears that one our our build nodes was present in the pool but not actually taking on work. This meant the overall CPU of the pool was artificially low and so an auto-scaling action did not kick in properly. We're working to identify the bad node for further investigation.
Posted Jun 23, 2016 - 03:01 EDT
Identified
Analysis queues are currently backed up by 8 minutes. We're bringing on new workers now.
Posted Jun 23, 2016 - 02:54 EDT