From June 16th to July 27th, Code Climate experienced a slight increase, from .05% to .17%, in analysis errors that were identified with the code U10. While we have a number of different kinds of analysis error codes, each identified with a specific cause and result, the cause of a U10 error is, at the time it occurs, “unknown”, unexpected and indicates a problem we need to investigate.
The analysis errors that occurred were effectively random across running builds. During the time period of the incident, if your analysis errored with a U10 error code, typically indicated by an email and a message on the affected build page, you were likely impacted by this incident and we apologize for the disruption.
We found varied and complicated reasons for the increase, and our visibility into what had been occurring was poor. These factors made the investigation difficult and time-consuming, taking approximately 3 weeks to bring the error rate back to an acceptable level.
We identified and corrected multiple independent root causes as described below. Our U10 error rate since July 27th is the lowest it's ever been at 0.001%. We've also corrected our lack of visibility to prevent an increase in error rates from going undetected like this again.
For those interested, the following is a detailed description of the problems we identified and the ways in which we fixed them. Beware, distributed systems and domain-specific jargon ahead!
There were two sources of U10 errors not related to networking issues or Kafka leader elections, which was our suspected root cause for much of the incident. These sources were contributing to the rate, increasing in frequency, and making it hard to gauge how our efforts to resolve the incident were going:
By July 21st, we addressed the above two sources, reverted an attempted fix from July 8th (which we believe made things worse), and identified and replaced a bad Kafka instance, which was triggering a high rate of leader elections. Those actions were enough to bring our U10 error rates down to an acceptable level, however, we were still seeing another kind of U10 pop up sporadically: a document counts mismatch error.
When we run an analysis, we emit issue
documents to Kafka. A downstream service then reads them out of Kafka and pushes them to our database. When analysis completes, the process that did the emitting sends a message indicating how many issues it emitted. When another downstream service goes to finalize the analysis, it checks to see how many issues are actually in the database. If the numbers don't match, this will results in the document counts mismatch error. This ensures that if there are any networking problems that cause issues to be lost or duplicated, we won't mistakenly report them as fixed or new.
These were the errors we'd been attributing primarily to leader elections, but we were now finding they can happen to some degree at any time. These were the errors we needed to explain and address before we would consider the incident resolved.
Kafka guarantees at-least-once delivery. That means that it will retry a message unless it's 100% sure it was delivered. This is in contrast to at-most-once delivery, where it would not retry a message unless it's 100% sure it was not delivered. In one case you can expect nothing lost but possible duplicates, in the other you can expect no duplicates but things may be lost. To build the appearance of exactly-once delivery, the approach is to use at-least-once delivery then remove any duplicates when you read the data back.
With this understanding, we aimed to correct the mismatch-over case (when we thought we emitted X issues but ended up seeing more than X issues in the database). We did this by pre-computing the database primary key and including it with the emitted issue. This way, if our consumer sees the same issue twice, the second will fail with a duplicate key exception and not end up in the database.
This was one of the necessary fixes, but implementing this alone actually made things worse overall. This fix was implemented once, but reverted until we were able to understand the following:
Our consumer, the one that reads issues out of Kafka and puts them in the database, does so in batches for performance reasons. What we didn't realize was that if any record in the batch violated a unique key, the whole batch was dropped. When this was originally implemented, we considered that a rare and unexpected occurrence, not a normal part of our architecture's design.
This explains why we would see mismatch-under cases, even though Kafka is guaranteeing at-least-once delivery and nothing is being dropped. If a retried message violated a unique key, we'd lose the whole batch and find a less-than-expected number of issues in the database. If a retried message didn't violate a unique key, it would be accepted and we'd find more-than-expected. By introducing the pre-generated ids, we actually increased the unique key violations and also increased the dropped batches and mismatch-under cases.
Eventually, we implemented the following overall fix:
Since deploying this at 11AM EDT on July 27th, we've seen 0 U10 errors related to document count mismatches.
We've now set up fairly sophisticated monitoring and alerts around U10 errors and will keep a close eye on how that rate changes in order to quickly identify causes and make further corrections.
We're going to develop a plan to put our Kafka cluster into a configuration where nodes are rotated regularly. This will ensure that anything related to node rotations and leader elections is robust and exercised frequently. We've also reached out to a consultancy that specializes in Kafka and will utilize them to evaluate and improve our architecture as it relates to Kafka.
We're extremely sorry for any inconvenience this incident has caused and are working hard to ensure things like this don't happen again.