Monitoring - A fix has been implemented and we are monitoring the results.
Jun 23, 12:49 EDT
Update - We're continuing to investigate the source of Wednesday's slow database inserts, but haven't yet identified a root cause or causes. We have identified some expensive queries that we're going to optimize or remove and are working with our database vendor to identify any other problems. While we did see a brief period of slower-than-normal performance yesterday afternoon (2:45-50PM EDT), customer-impact was low with analysis results delayed by no more than 30 seconds.
Jun 17, 11:26 EDT
Update - Insert times have returned to normal. We're still investigating the root cause.
Jun 15, 12:09 EDT
Investigating - Our service responsible for inserting analysis results into the database is reporting a delay of up to 3 minutes in handling its messages. Customer's may see analysis events and results appear to be delayed because of this. We're investigating.
Jun 15, 11:52 EDT
Website Operational
Platform Analysis ? Operational
Classic Analysis ? Operational
GitHub Operational
Operational
Degraded Performance
Partial Outage
Major Outage
System Metrics Month Week Day
Error Rate
Fetching
Analysis processing time
Fetching
Past Incidents
Jun 30, 2016

No incidents reported today.

Jun 29, 2016
Resolved - We confirmed the cause of the U10 analysis errors was the Kafka leader election. We haven't seen such an error since approximately 11:15 EDT. Leader elections are an infrequent but expected event in our system, and we have work planned to make clients more resilient to them. In response to this incident, we will increase the priority of that work.
Jun 29, 12:38 EDT
Monitoring - We were alerted to an increase in what we call "U10" analysis errors (errors with an unknown or unexpected cause) between 11:07 and 11:12 EDT. We're currently confirming a suspected root cause (leader election in our Kafka cluster), verifying the cluster is stable now, and exploring ways to make our analysis more resilient to that event in the future.
Jun 29, 11:18 EDT
Jun 28, 2016
Resolved - We consider this incident resolved. We're going to increase our normal worker count to prevent a recurrence and better handle these kind of increases in churn-calculation workload.
Jun 28, 12:56 EDT
Monitoring - The new workers have brought calculation time back to normal levels. We'll continue investigating the root cause.
Jun 28, 11:59 EDT
Identified - We've been alerted to slow churn calculations (the metric for the frequency of changes to a particular source code file). Users may find these metrics don't show up or aren't up to date for up to 8 minutes. We're bringing on more workers to bring that time down, while investigating the source of the problem.
Jun 28, 11:48 EDT
Jun 27, 2016

No incidents reported.

Jun 26, 2016

No incidents reported.

Jun 25, 2016

No incidents reported.

Jun 24, 2016

No incidents reported.

Jun 23, 2016
Resolved - We identified and corrected 3 impacted repositories. Sorry for any inconvenience.
Jun 23, 15:04 EDT
Identified - We shipped a code change that may have broken adding private repositories for some users. The change has been rolled back and adding repositories should work for everyone again at this time. We're identifying and correcting affected repositories now.
Jun 23, 14:53 EDT
Resolved - After detaching the bad node from the ASG, the CPU utilization of the group is now accurate and any auto-scaling actions should kick in as usual. We'll continue investigation tomorrow and determine how to prevent a node that fails to come up properly from reporting as healthy in the ASG and causing a problem like this in the future.
Jun 23, 03:25 EDT
Monitoring - We brought on 2 new worker nodes and queue times have returned to normal. It appears that one our our build nodes was present in the pool but not actually taking on work. This meant the overall CPU of the pool was artificially low and so an auto-scaling action did not kick in properly. We're working to identify the bad node for further investigation.
Jun 23, 03:01 EDT
Identified - Analysis queues are currently backed up by 8 minutes. We're bringing on new workers now.
Jun 23, 02:54 EDT
Jun 22, 2016

No incidents reported.

Jun 21, 2016

No incidents reported.

Jun 20, 2016

No incidents reported.

Jun 19, 2016

No incidents reported.

Jun 18, 2016

No incidents reported.

Jun 16, 2016

No incidents reported.