build cycles, development cycles, and the nag server
Tuesday, 29 April 2003
One thing I've noticed about our continuous integration process and its meta-process is that both are cyclical. Here's three examples:
..FF...FFFFFFFFFF...., or when it rains, it pours
One of the first things you notice when applying a continuous integration process to parallel development is that integration problems compound themselves. (This isn't a side effect of continuous integration, CI simply makes this point clear.)
On most days we'll have (indeed, we expect) the occasional build failure--someone neglected to check in a file here, or forgot that component was using this method whose signature has changed, etc. On good days, the team is on its collective toes and responds quickly to the failure notice. The problems are resolved in a build or two and we're right back on track. On bad days, an integration problem will persist for several builds, and as yet undiscovered and unreported problems will begin to queue up behind the first one as folks continue to make commits. The longer the queue of integration problems get, the longer it gets. As a result, bad days have a nasty habit of turning into bad weeks.
Anecdotally, the tipping point seems to be around three consecutive build failures. When we hit the third consecutive failure, the odds of hitting a fourth, a fifth, or even a tenth failure seems to increase dramatically. I think there are at least two reasons for this: (1) Unless the developers notice and fix the problem on their own--independent of the CI build failure notifications--failures will almost always come in pairs. By the time the nag email is delivered, let alone diagnosed and fixed, a second build will already be in progress. If the problem is going to be fixed "immediately", then it will be fixed for the third build, but generally not before it. (2) Depending upon the specific component that encounters problems, a failed build takes 20 to 40 minutes. Three consecutive failures means its been an hour, maybe two, since our last complete build. On a busy day we can have quite a few commits in an hour, so the number of unintegrated changes grows quickly. All of these changes need to be addressed before we reach a successful build.
two steps forward, one step back
We've found that our CI process doesn't progress monotonically. We're constantly trying to strike a balance between the functionality provided by the CI builds, the time it takes to complete a build cycle, and the likelihood of extraneous build failures. Often we find that we've stretched a bit too far and need to pull back.
Often the time it takes to complete a build is a driving factor. For a time we generated and published JavaDoc documentation and various source code and test coverage metrics following every successful build, but we found this added too much time to the build cycle. (As discussed above, the longer changes wait to be integrated, the greater our exposure to risk.) Instead, we rely upon cron-driven or manual processes to generate these artifacts.
At times, and to my great frustration, we've had to remove aspects of the build that were simply too brittle for continuous use. Our Latka-based functional test suite, which tests a number of our web applications, was largely removed from the CI builds largely because of test rot and the instabilities introduced by being too dependent upon external services that change outside of the build process (database servers are one, although not the only example here). The CI process still deploys our web based applications, but leaves most of the functional testing to manual invocation.
the nag server
From time to time we find that the discipline of continuous integration begins to slip: successful builds become less frequent and broken builds are fixed more slowly. The team gets used to seeing frequent build failures, begins ignoring nag messages, becomes complacent, and tends to look for local workarounds rather than addressing the global integration issues. (This is an instance of the fix broken windows pattern.) Despite what Fowler will tell you, in my experience there are some developers who are more than happy to give up the benefits of a continuous integration process, or who fail to recognize what those benefits are in the first place. (Perhaps not coincidentally, many of these developers haven't been working here as long as the CI process has.) In part these cycles are an extension of the cycles above--cruft and insufficiently considered work-arounds build up until the code base is fundamentally brittle. In part these cycles are related to the evolution of the code base--major, cross-component refactorings and shotgun maintenance sometimes lead to periods of build instability. We've found that whatever the cause sometimes the team needs a little kick to get back on track.
Morgan recently suggested one pleasingly simple kick of this sort. While discussing a low point in our CI cycle, Morgan half-jokingly suggested that what we needed was a "nag server"--a giant, prominently located monitor that displays the status of the current build. We jumped on it literally immediately, grabbed the biggest monitor we could find and an underused server; and set them up in the hallway in front of the primary development cube farm with a web browser continuously refreshing a simplier, bigger and bolder version of our build status page. Morgan later added an automated count of the number of consecutive build failures or successes (replacing a flip chart we manually updated for a while).
The nag server worked rather well for a while--it improved the build success:failure ratio and increased the visibility of the continuous integration process both within the development team and among the "customer" team. We often find developers gathered around the nag server discussing the current integration problem or checking to see if their changes made it into the current build. Unfortunately, the nag server eventually became less effective as a motivational tool. Additional measures seemed necessary. More on that in a later post.