The second half of this Hudson-adoption case study sees the team working through some challenges and setbacks. Do they meet their goals? Find out how this virtualization journey ends.
In part 1 of this article I described the trials and tribulations of our Hudson build environment at my workplace. This environment started out as a simple system that could build and test our code in a few minutes. Over the years, the build time increased until we had to wait far too long for feedback from the system, and I wanted to solve this problem by trying a pool of virtualized build servers.
We have been using server virtualization around the office for about three years now. We've even had some virtualized servers in our production environment. This technology is great and works as advertised.
We decided to buy a single eight-core machine and split it into eight virtual build slaves. On paper, this seemed like a perfect solution to our problem, so it was surprising that we just couldn't get the money approved for it. Eight core servers (two CPUs with four cores each) are standard and not that expensive right now (about $3,000), especially considering the cost of having highly paid engineers wait for a build. However, this upgrade seemed always to be put on the back burner until the issue happened again.
Here We Go Again
At that point, our main compile build was generating 738 MB of data. This build ran in isolation on the master server, as moving that much data across the wire back to the master from a slave would have added to the build time, which was already at fifteen minutes.
On August 2, the master started to crash. Lisa Crispin, our tester, sent an email to the team at 8 p.m. that said, "Hudson just start freaking out." Our main Linux guy responded, "The server is seriously ill," and included the following log information:
I read the emails and knew we had just lost the disks. The thing was RAID 5 hardware, but it was no use. In the morning, our Unix guru tried to restart the box, but it did not work—the controller (Dell PERC 4) just started to reinitialize the drives. We had officially lost our entire configuration.
We had an old Dell PE 850 powered off in the rack, and I decided to rebuild on that while the rest of the team was sharpening the pitchforks. It took about a day just to get the compile build back working again. This was a slower machine, so the build time went up to seventeen minutes, but at least the team put the pitchforks away.
Time to Implement Something New
It took a long time to rebuild everything and, at the same time, we had some major software architecture changes that made it hard to determine whether a build was failing because of a new Hudson configuration issue or because of our code changes.
The good news was that this failure prompted management to approve not only our original request but also a new Hudson master to replace the failed box. After some debates and a lot of planning, we decided to make everything virtual—even the master—in order to guard against another hardware failure that we knew would happen at some time in the future. If the system crashed again, any virtual machines (VMs) on the crashed box could migrate to the working box. If we did this correctly, we would no longer have any downtime due to hardware failures.
The Dawn of a New Generation
Before I could commit 100 percent to the virtualization path, I needed the performance data to back up the decision. Recall from part one that our precrash Hudson server could do the compile