We'd like to provide an update with the results of our analysis of yesterday's incident that impacted the Code Review report and code review metrics.
Yesterday, July 16th around 7:30pm EDT, a user reported experiencing an error on the All People page. In response, we initiated an investigation and found that this issue had a broader impact. We have a 10-second timeout set for the queries that power the reports, and we identified that all queries related to code review metrics were timing out. We put up a status page and a Code Climate engineer on call continued investigating the root cause of the issue.
At 8:21pm EDT, we identified that the queries related to code review were not using an appropriate index. Upon further inspection, we determined that earlier that day (at 4:15pm EDT) a migration was executed that removed an unused column, which unintentionally removed a compound index that referenced that column. At 9:00pm we added back the missing index excluding the removed column which immediately restored the functionality related to code review metrics.
During the root cause analysis discussion, we identified we were not proactively notified of these timeouts because our timeout alerts were grouped together for all pages. This made it difficult to detect that a new report had recently become impacted and why our time to discover was not up to par with our standards. As an action item, we have updated our alerting to be notified of timeouts on a per-report basis which will allow us to promptly investigate any future issues and reduce customer impact.
We apologize for the inconvenience caused by this incident.