Elevated analysis errors
Incident Report for Code Climate
Postmortem

From June 16th to July 27th, Code Climate experienced a slight increase, from .05% to .17%, in analysis errors that were identified with the code U10. While we have a number of different kinds of analysis error codes, each identified with a specific cause and result, the cause of a U10 error is, at the time it occurs, “unknown”, unexpected and indicates a problem we need to investigate.

The analysis errors that occurred were effectively random across running builds. During the time period of the incident, if your analysis errored with a U10 error code, typically indicated by an email and a message on the affected build page, you were likely impacted by this incident and we apologize for the disruption.

We found varied and complicated reasons for the increase, and our visibility into what had been occurring was poor. These factors made the investigation difficult and time-consuming, taking approximately 3 weeks to bring the error rate back to an acceptable level.

We identified and corrected multiple independent root causes as described below. Our U10 error rate since July 27th is the lowest it's ever been at 0.001%. We've also corrected our lack of visibility to prevent an increase in error rates from going undetected like this again.

Technical Details

For those interested, the following is a detailed description of the problems we identified and the ways in which we fixed them. Beware, distributed systems and domain-specific jargon ahead!

There were two sources of U10 errors not related to networking issues or Kafka leader elections, which was our suspected root cause for much of the incident. These sources were contributing to the rate, increasing in frequency, and making it hard to gauge how our efforts to resolve the incident were going:

  1. Occasionally, our producers would encounter an invalid memory condition in the system's gzip library that would cause this exception when they tried to compress the message to send. We implemented a simple retry around this operation, and the exception hasn't recurred.
  2. When an engine runs for too long or generates too much output, we kill it and raise a useful E-class error. The operation to kill the engine has a 2 minute timeout of its own. Sometimes docker would refuse to kill the container within 2 minutes and so this exception would be raised. Our solution here was to ignore it so that the ongoing (and better) error handling could complete.

By July 21st, we addressed the above two sources, reverted an attempted fix from July 8th (which we believe made things worse), and identified and replaced a bad Kafka instance, which was triggering a high rate of leader elections. Those actions were enough to bring our U10 error rates down to an acceptable level, however, we were still seeing another kind of U10 pop up sporadically: a document counts mismatch error.

When we run an analysis, we emit issue documents to Kafka. A downstream service then reads them out of Kafka and pushes them to our database. When analysis completes, the process that did the emitting sends a message indicating how many issues it emitted. When another downstream service goes to finalize the analysis, it checks to see how many issues are actually in the database. If the numbers don't match, this will results in the document counts mismatch error. This ensures that if there are any networking problems that cause issues to be lost or duplicated, we won't mistakenly report them as fixed or new.

These were the errors we'd been attributing primarily to leader elections, but we were now finding they can happen to some degree at any time. These were the errors we needed to explain and address before we would consider the incident resolved.

Kafka guarantees at-least-once delivery. That means that it will retry a message unless it's 100% sure it was delivered. This is in contrast to at-most-once delivery, where it would not retry a message unless it's 100% sure it was not delivered. In one case you can expect nothing lost but possible duplicates, in the other you can expect no duplicates but things may be lost. To build the appearance of exactly-once delivery, the approach is to use at-least-once delivery then remove any duplicates when you read the data back.

With this understanding, we aimed to correct the mismatch-over case (when we thought we emitted X issues but ended up seeing more than X issues in the database). We did this by pre-computing the database primary key and including it with the emitted issue. This way, if our consumer sees the same issue twice, the second will fail with a duplicate key exception and not end up in the database.

This was one of the necessary fixes, but implementing this alone actually made things worse overall. This fix was implemented once, but reverted until we were able to understand the following:

Our consumer, the one that reads issues out of Kafka and puts them in the database, does so in batches for performance reasons. What we didn't realize was that if any record in the batch violated a unique key, the whole batch was dropped. When this was originally implemented, we considered that a rare and unexpected occurrence, not a normal part of our architecture's design.

This explains why we would see mismatch-under cases, even though Kafka is guaranteeing at-least-once delivery and nothing is being dropped. If a retried message violated a unique key, we'd lose the whole batch and find a less-than-expected number of issues in the database. If a retried message didn't violate a unique key, it would be accepted and we'd find more-than-expected. By introducing the pre-generated ids, we actually increased the unique key violations and also increased the dropped batches and mismatch-under cases.

Eventually, we implemented the following overall fix:

  • Assume Kafka can provide at-least-once delivery
  • Pre-generate primary keys so duplicate messages aren't accepted by the database
  • When our database-writing consumer encounters a duplicate-key exception, don't abort the whole batch, instead retry it one at a time so the records that aren't duplicates are still inserted

Since deploying this at 11AM EDT on July 27th, we've seen 0 U10 errors related to document count mismatches.

Going Forward

We've now set up fairly sophisticated monitoring and alerts around U10 errors and will keep a close eye on how that rate changes in order to quickly identify causes and make further corrections.

We're going to develop a plan to put our Kafka cluster into a configuration where nodes are rotated regularly. This will ensure that anything related to node rotations and leader elections is robust and exercised frequently. We've also reached out to a consultancy that specializes in Kafka and will utilize them to evaluate and improve our architecture as it relates to Kafka.

We're extremely sorry for any inconvenience this incident has caused and are working hard to ensure things like this don't happen again.

Posted Aug 17, 2016 - 11:25 EDT

Resolved
We implemented several more remediations against these errors last week, and our rate of U10s has been extremely low since Thursday. We consider this incident resolved, and will be following up with a more in-depth post-mortem, in addition to putting further work into preventing similar issues from affecting us for this long a span of time.
Posted Aug 01, 2016 - 13:42 EDT
Monitoring
We've implemented a number of fixes over the course of this week to address the intermittent spikes of U10 analysis errors. The rate has been very good for the past day, but we'll continue to monitor things for a bit longer before considering this resolved.
Posted Jul 27, 2016 - 17:15 EDT
Update
We made configuration changes to our interaction with Kafka on Monday intended to address our resiliency to Kafka replica rebalancing events. Unfortunately, while those changes did improve our behavior during rebalancing events, they had other, unanticipated negative consequences that caused an elevated rate of analysis errors under normal operating conditions. Recognizing that the changes had unintentionally made things worse overall, we reverted them, and have seen error rates drop to normal levels.

For the moment, this means that error rates are within acceptable bounds, but it also means that we still are not resilient during Kafka replica rebalancing. We are continuing to investigate improving our resiliency to these events, without negatively impacting normal reliability.
Posted Jul 21, 2016 - 18:37 EDT
Update
We've tracked the source of the analysis errors to Kafka replica rebalancing. We have a configuration change planned for Monday (when we'll see normal traffic levels again) to increase our resiliency to such events. We will continue to monitor stability over the weekend.
Posted Jul 15, 2016 - 18:03 EDT
Identified
At about 4:04PM EDT, we were alerted to an elevated rate of analysis errors. The rate is back to normal now, and we're investigating the root cause. Users may see "U10" analysis errors on builds from around that time.
Posted Jul 15, 2016 - 16:29 EDT