Inaccurate Analysis Results
Incident Report for Code Climate
Postmortem

On Monday, May 1st, our analysis engine began generating incorrect results for approximately 10% of repositories. This percentage increased to around 40% on May 4th. This bug presented as classes being added or removed when they were not, and files inexplicably jumping to or from an A grade.

Furthermore, some of the corrective measures we took had adverse effects. Some analyses failed, which resulted in error messages within GitHub pull requests. And finally, during cleanup of bad analyses, a few repos were left with zero analyses in our system, an unexpected condition that results in a confusing user experience.

We resolved the root cause of this issue on May 11th and have since identified affected repositories, and removed any potentially inaccurate analysis data. We are also in the process of notifying customers about which repositories may have been impacted.

This is one of the most significant operations failures we've had in the history of Code Climate. Generating incorrect analysis results at any time is unacceptable to us, and in this case we fell short of our own as well as our customers' expectations. We are very sorry that this occurred, and the entire development team is working hard to prevent similar issues in the future.

We are currently in the process of notifying customers about which repositories may have been impacted, issuing credits to customers with potentially significant impact, and extending all impacted trials.

The following is a detailed breakdown of what happened, including what we know now, what we knew as we worked through the incident, and what we’re doing to ensure this incident does not occur again.

Background

We are in the process of migrating our service that stores Git repositories to a machine with more disk space. For three weeks, new repositories have been getting created on our new server, and we have been working towards migrating all repositories to the new server. For safety, the relocation process did not immediately delete repos from the old server once they were created on the new server. Instead, after the move it updated our application data to begin accessing the repo from the new server, leaving the files on the old server to be cleaned up separately.

What Went Wrong

On April 30th, we kicked off a step in the Git repository migration process to move the remaining repositories to the new server. The process was running overnight, and on the morning of May 1st, we suffered an ironic problem: our old Git server's disk reached capacity while attempting to alleviate it. We are still unclear as to the cause of the unexpected spike in disk usage, as we believed at the time that we had enough headroom to migrate to a new server without deletions. Regardless, the then-full disk required that we begin deleting the files for those repos that had been migrated to the new server, and we did so. The emergency operation moved enough repos to get us out of an immediate danger zone. On the night of May 3rd in an effort to give us more disk space headroom, we migrated approximately 30% more of our repos to the new Git server, and deleted them on the old server.

On May 4th we began getting reports about incorrect analyses. Upon investigation, for some repositories, source files were being returned as blank strings to our analysis application. We noticed that while the Git server had the correct file contents, the Git server’s cache was returning blank strings. Suspecting an issue with our cache, we implemented a strategy to evict blank strings. Unfortunately, while this did improve things marginally, it did not resolve the issue.

We also have a cache in our analysis layer, above the Git server level. This cache stores marshaled Ruby objects from a Git client library we use. After receiving continued reports of bad analyses, on May 6th we inaccurately concluded that the empty strings returned by on our Git server had propagated up to our analysis cache. We began implementing strategies to flush the cache of bad data, and then on May 8th, feeling unsatisfied with the incremental approach we were taking, we took a more aggressive approach and rotated the cache key. This did result in a dramatic improvement to our analysis, and we didn’t receive reports of invalid analyses after this change. We don’t believe however that it completely resolved the issue.

At this point, we still had not identified a root cause. We felt confident our caches had been polluted, but we did not yet understand the source of the pollution. Therefore, we implemented additional debugging measures to collect more information. We simultaneously began a process to repair affected analyses to address the possibility that the cache pollution originated from an ephemeral (although not yet understood) issue, most likely the brief period when one of the Git servers was out of disk space.

Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present.

Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls.

When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server. Ironically, we’ve now determined that since the cached Ruby objects did not include the Git blob data, it never generated a performance benefit of any kind. Significantly contributing to the difficulty of debugging this issue, the library we use to read Git repositories in our git service returns empty strings if the repository directory does not exist on disk, rather than an exception. Armed with a root cause -- caches containing invalid service URLs -- we discovered one other call site that could exhibit similar problematic behavior.

The Fix

We changed the cache key for the problematic objects to include the URL of the service it expected. While this very targeted change is not our long term solution, it resolved the issue by effectively rotating the cache keys of any polluted objects. We feel confident that this addressed the issue. However, out of an abundance of caution, after confirming an approach that wouldn’t lead to a negative performance implications, we completely flushed the cache.

What We’re Doing About This

This issue surfaced a number of opportunities for improving our infrastructure, some of which we’ve already implemented and some of which we will be implementing or reviewing over the next couple weeks. These include:

  1. Failing fast on unexpected operating conditions. If you ask our Git service for data and the relevant repository does not exist on disk, we now raise an exception, instead of returning an empty string. We’ll be auditing our service for other areas where a “fail fast” approach should be taken.

  2. Removing the non-functional cache that made this issue so severe. As a principle, our preference is to cache either primitives or simple objects that we control. Further, attempting to cache similar kinds of data across multiple layers introduced unnecessary complexity, and we will review our infrastructure for occurrences of this.

  3. Advancing current and future data and service migrations aggressively (or explicitly aborting them) once started. The calendar time since we began the migration to the new Git server, as well as unrelated in-progress migration to a new Git service written in Go, contributed to the long diagnosis time. We will complete both of these projects as soon as possible.

  4. Ensuring that we do not cache failed response values. It is no longer possible for us to store empty strings in our Git cache, for example, and we are auditing our code for other places where this may be possible.

  5. Supporting graceful deployments of backend services such that we don’t fail analyses when deploying them. While deploying fixes for this issue, we created additional collateral damage in the form of failed analyses that was avoidable with better deployment processes.

  6. Developing better end-to-end introspection capabilities so that we are able to see a detailed log or call graph of all service invocations performed during the processing of an analysis when necessary.

  7. Taking measures to ensure we record input and output from server maintenance tasks and shell sessions in order to be able to use that information for debugging and corrective actions should that become necessary.

  8. Segregating responsibilities to minimize the amount of logic and service access performed in our analysis engine, which would increase our agility in rapidly deploying fixes to core components when required. This work has already began.

Summary

We understand that people depend on Code Climate for timely, accurate static analysis of their source code. In that regard, being "online" but producing incorrect analysis results is worse than a full outage, because our value to our users is deeply connected to their ability to trust the results that we produce.

The entire Code Climate team is very sorry for the impact this has had on our customers. We are investing heavily in ensuring that an analysis issue of this magnitude does not occur again. We appreciate your continued support as we work harder to serve you better!

Posted May 13, 2015 - 14:08 EDT

Resolved
We have finished removing all bad analyses and are regenerating correct data. Repos might be missing analyses while we do that, but clicking the refresh button on the repo or branch comparison will enqueue an analysis for that specific view. A full post-mortem will be posted later today.
Posted May 13, 2015 - 13:15 EDT
Update
The code path mentioned yesterday that we feared could be generating invalid analyses has been addressed -- we deleted the potentially impacted analyses and bumped our caches. Our data analysis suggests that this pathway created very few to no actual problems in the past three days, however, in an effort to be comprehensive we've removed them anyways.

The code path could have potentially returned an empty string for the listing of files in a repository, so if you were impacted by this problem you would have seen reports (Feed/emails/integration notifications) of all files in the repository being added or removed (yes, pretty dramatic).

Since this code path was addressed and bad analyses have been removed, we are currently not aware of any remaining issues with analyses, with the exception of some security analyses run between 5/1/15 and 5/4/15 that we're still actively investigating.

We're very sorry again for the trouble here. We are continuing to treat this as our top-most priority, by doing all that is possible to prevent new invalid analyses from being run, as well as fixing those that already ran in the past. We will continue to post updates as new data is available.
Posted May 12, 2015 - 14:06 EDT
Update
We have now patched the additional pathway mentioned earlier. In order to apply this patch, we will unfortunately need to skip the next analysis/commit for each repository. Doing so will further reduce the chance of any poison cache issues affecting our system going forward.
Posted May 11, 2015 - 19:59 EDT
Update
We have identified one additional pathway in which the poison cache could still be in play.

We have also identified that -- in addition to our code quality analysis -- some Rails security scans that were run between 5/1/15 and 5/4/15 may also have been affected by the poison cache issue.

We are now actively investigating both issues above. This unfortunately means that we are not yet 100% confident that the analyses we are currently creating are valid. We are simultaneously investigating ways to ensure these analyses fail harder until we can address this issue. We’re very sorry for the trouble, and will continue to provide regular updates on this issue.
Posted May 11, 2015 - 17:48 EDT
Update
For any repository that was previously left with zero analyses in our system (see the previous update below), a new analysis has now been successfully regenerated.

For all other repositories affected by this issue, we are still in the process of regenerating their invalid/deleted analyses. We will continue to post updates as this process progresses.
Posted May 11, 2015 - 14:45 EDT
Update
We have now deleted all older analyses that contained invalid data, and we have also started regenerating each analysis.

In the meantime, some repositories may be left with zero analyses in our system, which will result in the following error message being displayed: "There's an issue with this repo. Please contact us." We are working on prioritizing our analysis regeneration to run against these repositories first.
Posted May 11, 2015 - 11:51 EDT
Update
We determined the source of the poisoned cache and it turns out it was not the deployment of our git service as we had previously theorized. While we were on track to having it never re-occur again, we can now can be very confident that measures we put into place will prevent this from happening again.

We are still generating good new analyses going forward. However, some old comparisons with older analyses may still be incorrect. We will be working first to delete these older anlayses to purge the system of invalid data, before going back to re-generating them.
Posted May 10, 2015 - 16:52 EDT
Update
We have deployed a more aggressive cache expiration method that will help speed up the recovery process. Repos will miss a comparison because of this change, but it will help us get everything back to normal sooner. Any new analyses will be correct with this change, but affected or failed analyses from earlier in the week may not yet be fixed. We have fixed many of these older analyses and our script is working through the rest.
Posted May 08, 2015 - 17:11 EDT
Monitoring
On late Sunday night we deployed one of our internal services responsible for serving up git data. We believe now that this deploy polluted the cache in this service and started reporting that some source files were empty that, in fact, were not.

On Monday, after customer reports about invalid analyses alerted us to the issue, we deployed a fix to this service to ensure that we never return invalid empty strings. We started monitoring the fix to determine if it was sufficient.

We continued to receive intermittent reports about bad analyses. Because the issue was historical, it was hard to determine if the issues that were being newly reported were the previous issue or new cases of it. We also noticed strange behavior that re-running analyses would partially fix the issue over time.

After a lot of digging, we could see that the bad cache values had propagated to other areas of the application. As of last night, we are now running a script to re-run potentially bad analyses and delete bad cache values. This will take a bit of time; but because of how caches work, evicting older values over time, the cache will eventually correct itself, possibly even before our corrective script finishes.

We’re really sorry about this. The accuracy of our alerts and analysis is extremely important to us. We never like to see bad data go out and wish we’d been able to catch this sooner. In addition to the corrective action we took, we’ll be doing further root cause analysis on this issue over the next few days. We will update this space with a fuller post-mortem when everything is resolved.
Posted May 07, 2015 - 15:20 EDT
Identified
We've identified an issue with multiple caches containing empty values for file contents. We're in the process of repairing the impacted snapshots.
Posted May 06, 2015 - 23:26 EDT
Update
We're pausing our analysis while we investigate this issue further
Posted May 06, 2015 - 17:32 EDT
Investigating
We are investigating reports of our git server returning incorrect data, causing comparisons to show that files were added and/or removed when they were not
Posted May 06, 2015 - 12:30 EDT