Data processing delays
Incident Report for Code Climate
Postmortem

We’d like to provide an initial update on our ongoing investigation and analysis related to database instability issues that impacted the Velocity application earlier this week.

On Monday, May 20th, we began to see sporadic occurrences of an unexpected error resulting from a small portion of queries to our AWS Aurora Postgres cluster. These errors resulted in 500 errors when accessing certain Velocity pages, but not at a high enough rate to trigger our alerts. We proceeded to investigate the issue, however. Since the errors appeared to be related to internal Aurora functions, and they subsided independently, we incorrectly considered the incident to be resolved.

On Tuesday, May 21st at 9:28am (EDT), we were alerted to delay in our data processing subsystems responsible for updating the metrics powering our reporting interfaces. Upon examination, we observed elevated load in our secondary instance of AWS Aurora Postgres.

After manual intervention to reduce DB load failed, our site reliability engineers took the action to bring a new secondary instance in an attempt to resolve the issue. This operation was unsuccessful, and after further diagnosis we were able to identify an elevated failure rate for a query accessing a subset rows in at least one table, suggesting that issue we observed on Monday was still ongoing.

During this period of time, we disabled our background processes responsible for data updates and notified customers via in-app messaging and our status page that their data may be temporarily out of date. Additionally, during this period, certain requests to our web interface resulted in 500 errors leading to some portion of site functionality being unavailable.

We worked with AWS technical support to understand what was causing those rows to be inaccessible. Through diagnostics, AWS was eventually able to narrow the problem to inconsistent Aurora internal metadata for a small portion of rows across a subset of our tables. In conjunction with us, AWS proceeded to run a repair operation correcting the internal metadata. Following this operation, and after a period of catching up on data updates, we were able to resume full service of the Velocity application.

We are working with the Aurora team at AWS in order to conduct a root cause analysis to understand what triggered the errors accessing the affected rows. We are also conducting a root cause analysis internally to identify opportunities for improvements in our incident response. We will post the results of this analysis to this status page.

Posted May 24, 2019 - 15:30 EDT

Resolved
This incident has been resolved.
Posted May 22, 2019 - 12:57 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 22, 2019 - 00:38 EDT
Update
We've turned back on all components of our data processing pipeline. We're now processing the backlog of work. Reports may still be missing the most recent data.
Posted May 21, 2019 - 23:41 EDT
Update
We've taken some manual actions to repair the database problem. We're still not processing new data while we perform some additional tests and await some information from our database vendor.
Posted May 21, 2019 - 21:39 EDT
Update
Another update: We're continuing to work through a database issue that is preventing our system from processing new data. There is a partial outage that affects some users. Once again, we apologize for the inconvenience and will update this feed as new information becomes available.
Posted May 21, 2019 - 18:22 EDT
Identified
An update: We have identified an issue writing certain reporting data. We've stopped processing new data for now and we're taking steps to remedy the issue. We apologize for the inconvenience, and we'll share an update as soon as we have more information.
Posted May 21, 2019 - 14:30 EDT
Investigating
We're investigating the cause of database unhappiness, which is causing delays in keeping our reports up to date. Stay tuned.
Posted May 21, 2019 - 09:28 EDT
This incident affected: Code Climate Velocity.