Update - We made configuration changes to our interaction with Kafka on Monday intended to address our resiliency to Kafka replica rebalancing events. Unfortunately, while those changes did improve our behavior during rebalancing events, they had other, unanticipated negative consequences that caused an elevated rate of analysis errors under normal operating conditions. Recognizing that the changes had unintentionally made things worse overall, we reverted them, and have seen error rates drop to normal levels.

For the moment, this means that error rates are within acceptable bounds, but it also means that we still are not resilient during Kafka replica rebalancing. We are continuing to investigate improving our resiliency to these events, without negatively impacting normal reliability.
Jul 21, 18:37 EDT
Update - We've tracked the source of the analysis errors to Kafka replica rebalancing. We have a configuration change planned for Monday (when we'll see normal traffic levels again) to increase our resiliency to such events. We will continue to monitor stability over the weekend.
Jul 15, 18:03 EDT
Identified - At about 4:04PM EDT, we were alerted to an elevated rate of analysis errors. The rate is back to normal now, and we're investigating the root cause. Users may see "U10" analysis errors on builds from around that time.
Jul 15, 16:29 EDT
Website Operational
Platform Analysis ? Operational
Classic Analysis ? Operational
GitHub Operational
Operational
Degraded Performance
Partial Outage
Major Outage
System Metrics Month Week Day
Error Rate
Fetching
Analysis processing time
Fetching
Past Incidents
Jul 23, 2016

No incidents reported today.

Jul 22, 2016

No incidents reported.

Jul 20, 2016
Resolved - This incident has been resolved.
Jul 20, 12:12 EDT
Monitoring - Wait time for churn calculations is within normal bounds again. We are continuing to monitor the workers.
Jul 20, 11:47 EDT
Identified - We've been alerted to slow churn calculations (the metric for the frequency of changes to a particular source code file). Users may find these metrics don't show up or aren't up to date. We're bringing on more workers to bring that time down, while investigating the source of the problem.
Jul 20, 11:42 EDT
Jul 19, 2016
Resolved - Amazon is now reporting as fully available, we've regained access to manage our instances, and our system appears healthy.
Jul 19, 17:10 EDT
Update - AWS has confirmed their increased API error rates and also notified of reduced network connectivity which we believe is the root cause of our degraded service. We do have a set of nodes back online now and analyses are occurring. Unfortunately, demand is high and the AWS API outage is preventing our registering new nodes to keep up.
Jul 19, 14:56 EDT
Update - Amazon Web Services is currently reporting increased API error rates in our region. We suspect this is the cause or our inability to register analysis nodes into service. It's also making it incredibly difficult to troubleshoot. We'll provide updates as we have them.
Jul 19, 14:33 EDT
Identified - We've been alerted that no analyses are taking place and our analysis worker instances are in a failed state. We're investigating that that's true and working to correct the cause.
Jul 19, 14:22 EDT
Jul 18, 2016
Resolved - This incident has been resolved.
Jul 18, 12:33 EDT
Monitoring - The affected platform services have recovered. All relevant metrics are now under alerting thresholds.
Jul 18, 12:11 EDT
Update - We've deployed a fix to production which is taking affect for new builds. We're expecting the system to stabilize as active builds finish.
Jul 18, 12:04 EDT
Identified - We're experiencing a delay processing analysis results. We have identified the most likely culprit and are actively working towards a fix.
Jul 18, 11:44 EDT
Jul 17, 2016

No incidents reported.

Jul 16, 2016

No incidents reported.

Jul 14, 2016

No incidents reported.

Jul 13, 2016
Resolved - The fixed engine has been released. Users should no longer experience this error.
Jul 13, 10:02 EDT
Identified - Yesterday, we released an update to our rubocop engine that upgraded the rubocop gem it uses. Unfortunately, the version of the rubocop-rspec plugin the engine uses was not updated and is causing errors for any users of that plugin. We're working on a fix now and should have a new engine released soon.
Jul 13, 09:34 EDT
Jul 12, 2016
Resolved - Analysis wait times have returned to normal.
Jul 12, 13:26 EDT
Monitoring - We're experiencing a delay in analyses starting because our system scaled down when the recent GitHub outage caused a significant decrease in analysis traffic. The scaled-down system wasn't prepared when normal analysis traffic resumed. We're currently scaling up and should have analysis times back to normal shortly. We're going to increase the number of instances we hold in reserve during the to day to prevent these sorts of issues going forward.
Jul 12, 13:00 EDT
Jul 11, 2016

No incidents reported.

Jul 10, 2016

No incidents reported.

Jul 9, 2016

No incidents reported.