On Monday, May 1st, our analysis engine began generating incorrect results for approximately 10% of repositories. This percentage increased to around 40% on May 4th. This bug presented as classes being added or removed when they were not, and files inexplicably jumping to or from an A grade.
Furthermore, some of the corrective measures we took had adverse effects. Some analyses failed, which resulted in error messages within GitHub pull requests. And finally, during cleanup of bad analyses, a few repos were left with zero analyses in our system, an unexpected condition that results in a confusing user experience.
We resolved the root cause of this issue on May 11th and have since identified affected repositories, and removed any potentially inaccurate analysis data. We are also in the process of notifying customers about which repositories may have been impacted.
This is one of the most significant operations failures we've had in the history of Code Climate. Generating incorrect analysis results at any time is unacceptable to us, and in this case we fell short of our own as well as our customers' expectations. We are very sorry that this occurred, and the entire development team is working hard to prevent similar issues in the future.
We are currently in the process of notifying customers about which repositories may have been impacted, issuing credits to customers with potentially significant impact, and extending all impacted trials.
The following is a detailed breakdown of what happened, including what we know now, what we knew as we worked through the incident, and what we’re doing to ensure this incident does not occur again.
We are in the process of migrating our service that stores Git repositories to a machine with more disk space. For three weeks, new repositories have been getting created on our new server, and we have been working towards migrating all repositories to the new server. For safety, the relocation process did not immediately delete repos from the old server once they were created on the new server. Instead, after the move it updated our application data to begin accessing the repo from the new server, leaving the files on the old server to be cleaned up separately.
On April 30th, we kicked off a step in the Git repository migration process to move the remaining repositories to the new server. The process was running overnight, and on the morning of May 1st, we suffered an ironic problem: our old Git server's disk reached capacity while attempting to alleviate it. We are still unclear as to the cause of the unexpected spike in disk usage, as we believed at the time that we had enough headroom to migrate to a new server without deletions. Regardless, the then-full disk required that we begin deleting the files for those repos that had been migrated to the new server, and we did so. The emergency operation moved enough repos to get us out of an immediate danger zone. On the night of May 3rd in an effort to give us more disk space headroom, we migrated approximately 30% more of our repos to the new Git server, and deleted them on the old server.
On May 4th we began getting reports about incorrect analyses. Upon investigation, for some repositories, source files were being returned as blank strings to our analysis application. We noticed that while the Git server had the correct file contents, the Git server’s cache was returning blank strings. Suspecting an issue with our cache, we implemented a strategy to evict blank strings. Unfortunately, while this did improve things marginally, it did not resolve the issue.
We also have a cache in our analysis layer, above the Git server level. This cache stores marshaled Ruby objects from a Git client library we use. After receiving continued reports of bad analyses, on May 6th we inaccurately concluded that the empty strings returned by on our Git server had propagated up to our analysis cache. We began implementing strategies to flush the cache of bad data, and then on May 8th, feeling unsatisfied with the incremental approach we were taking, we took a more aggressive approach and rotated the cache key. This did result in a dramatic improvement to our analysis, and we didn’t receive reports of invalid analyses after this change. We don’t believe however that it completely resolved the issue.
At this point, we still had not identified a root cause. We felt confident our caches had been polluted, but we did not yet understand the source of the pollution. Therefore, we implemented additional debugging measures to collect more information. We simultaneously began a process to repair affected analyses to address the possibility that the cache pollution originated from an ephemeral (although not yet understood) issue, most likely the brief period when one of the Git servers was out of disk space.
Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present.
Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls.
When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file
calls being issued to the old server instead of the new server. Ironically, we’ve now determined that since the cached Ruby objects did not include the Git blob data, it never generated a performance benefit of any kind. Significantly contributing to the difficulty of debugging this issue, the library we use to read Git repositories in our git service returns empty strings if the repository directory does not exist on disk, rather than an exception. Armed with a root cause -- caches containing invalid service URLs -- we discovered one other call site that could exhibit similar problematic behavior.
We changed the cache key for the problematic objects to include the URL of the service it expected. While this very targeted change is not our long term solution, it resolved the issue by effectively rotating the cache keys of any polluted objects. We feel confident that this addressed the issue. However, out of an abundance of caution, after confirming an approach that wouldn’t lead to a negative performance implications, we completely flushed the cache.
This issue surfaced a number of opportunities for improving our infrastructure, some of which we’ve already implemented and some of which we will be implementing or reviewing over the next couple weeks. These include:
Failing fast on unexpected operating conditions. If you ask our Git service for data and the relevant repository does not exist on disk, we now raise an exception, instead of returning an empty string. We’ll be auditing our service for other areas where a “fail fast” approach should be taken.
Removing the non-functional cache that made this issue so severe. As a principle, our preference is to cache either primitives or simple objects that we control. Further, attempting to cache similar kinds of data across multiple layers introduced unnecessary complexity, and we will review our infrastructure for occurrences of this.
Advancing current and future data and service migrations aggressively (or explicitly aborting them) once started. The calendar time since we began the migration to the new Git server, as well as unrelated in-progress migration to a new Git service written in Go, contributed to the long diagnosis time. We will complete both of these projects as soon as possible.
Ensuring that we do not cache failed response values. It is no longer possible for us to store empty strings in our Git cache, for example, and we are auditing our code for other places where this may be possible.
Supporting graceful deployments of backend services such that we don’t fail analyses when deploying them. While deploying fixes for this issue, we created additional collateral damage in the form of failed analyses that was avoidable with better deployment processes.
Developing better end-to-end introspection capabilities so that we are able to see a detailed log or call graph of all service invocations performed during the processing of an analysis when necessary.
Taking measures to ensure we record input and output from server maintenance tasks and shell sessions in order to be able to use that information for debugging and corrective actions should that become necessary.
Segregating responsibilities to minimize the amount of logic and service access performed in our analysis engine, which would increase our agility in rapidly deploying fixes to core components when required. This work has already began.
We understand that people depend on Code Climate for timely, accurate static analysis of their source code. In that regard, being "online" but producing incorrect analysis results is worse than a full outage, because our value to our users is deeply connected to their ability to trust the results that we produce.
The entire Code Climate team is very sorry for the impact this has had on our customers. We are investing heavily in ensuring that an analysis issue of this magnitude does not occur again. We appreciate your continued support as we work harder to serve you better!