Update - We made configuration changes to our interaction with Kafka on Monday intended to address our resiliency to Kafka replica rebalancing events. Unfortunately, while those changes did improve our behavior during rebalancing events, they had other, unanticipated negative consequences that caused an elevated rate of analysis errors under normal operating conditions. Recognizing that the changes had unintentionally made things worse overall, we reverted them, and have seen error rates drop to normal levels.
For the moment, this means that error rates are within acceptable bounds, but it also means that we still are not resilient during Kafka replica rebalancing. We are continuing to investigate improving our resiliency to these events, without negatively impacting normal reliability.
Jul 21, 18:37 EDT
Update - We've tracked the source of the analysis errors to Kafka replica rebalancing. We have a configuration change planned for Monday (when we'll see normal traffic levels again) to increase our resiliency to such events. We will continue to monitor stability over the weekend.
Jul 15, 18:03 EDT
Identified - At about 4:04PM EDT, we were alerted to an elevated rate of analysis errors. The rate is back to normal now, and we're investigating the root cause. Users may see "U10" analysis errors on builds from around that time.
Jul 15, 16:29 EDT
This incident has been resolved.
Jul 20, 12:12 EDT
Wait time for churn calculations is within normal bounds again. We are continuing to monitor the workers.
Jul 20, 11:47 EDT
We've been alerted to slow churn calculations (the metric for the frequency of changes to a particular source code file). Users may find these metrics don't show up or aren't up to date. We're bringing on more workers to bring that time down, while investigating the source of the problem.
Jul 20, 11:42 EDT
Amazon is now reporting as fully available, we've regained access to manage our instances, and our system appears healthy.
Jul 19, 17:10 EDT
AWS has confirmed their increased API error rates and also notified of reduced network connectivity which we believe is the root cause of our degraded service. We do have a set of nodes back online now and analyses are occurring. Unfortunately, demand is high and the AWS API outage is preventing our registering new nodes to keep up.
Jul 19, 14:56 EDT
Amazon Web Services is currently reporting increased API error rates in our region. We suspect this is the cause or our inability to register analysis nodes into service. It's also making it incredibly difficult to troubleshoot. We'll provide updates as we have them.
Jul 19, 14:33 EDT
We've been alerted that no analyses are taking place and our analysis worker instances are in a failed state. We're investigating that that's true and working to correct the cause.
Jul 19, 14:22 EDT
The fixed engine has been released. Users should no longer experience this error.
Jul 13, 10:02 EDT
Yesterday, we released an update to our rubocop engine that upgraded the rubocop gem it uses. Unfortunately, the version of the rubocop-rspec plugin the engine uses was not updated and is causing errors for any users of that plugin. We're working on a fix now and should have a new engine released soon.
Jul 13, 09:34 EDT
Analysis wait times have returned to normal.
Jul 12, 13:26 EDT
We're experiencing a delay in analyses starting because our system scaled down when the recent GitHub outage caused a significant decrease in analysis traffic. The scaled-down system wasn't prepared when normal analysis traffic resumed. We're currently scaling up and should have analysis times back to normal shortly. We're going to increase the number of instances we hold in reserve during the to day to prevent these sorts of issues going forward.
Jul 12, 13:00 EDT