Software quality folks seem to enjoy reading about bug hunts, so here is the tale of one particularly interesting bug that I hunted down lately. I'll go into some fairly low-level details, so my non-technical readers may want to skip down to the lessons listed at the bottom. I can't reveal which system I was working on, so I've changed some of the details.
One of my clients uses a script called "webupdate" to download new versions of the operating system software provided by an outsourced development team. The script copies the software from the outsourced vendor's web site and compiles it. One time when going through the process, something didn't seem quite right to me. It ran about as long as it usually does, but it seemed like the volume of output was quite a bit shorter than usual. I hadn't saved the output from a successful run before, so I couldn't know for sure. But I was pretty sure that this error near the beginning of the output hadn't been there before:
checking web site for new packages...
: web file not found
This was followed by the voluminous output that is the result of a successful build, and then a final message indicating that all was well. But none of the changes that were supposed to be in this release of the system were there. This was a showstopper, despite the misleading indication of success. I decided to investigate, to see whether this was a problem with our environment or whether I needed to report a bug to the vendor. Note that we ran on the previous version of the system to get the new one, and we hadn't had this kind of trouble with any previous version before.
Luckily, webupdate is a script rather than a compiled program, so I'm able to easily examine the implementation. By looking at the output right before and right after the mysterious ": web file not found" error, I determine that the error most likely came from a "getdir /" command. Okay, great. What does getdir do? I can't find any documentation for such a command, and I can't find a program by that name. Oh! Half a screen up in the webupdate script is a function named "getdir".
Okay, I'm getting closer to the source of the problem, but I don't know how close yet. There is nothing in the function directly that prints out the text "web file not found." But there are a few calls to external programs. The second one is preceded by an "echo downloading $name..." message, which I didn't get in the output, so I explore the first, which is a call to a program called "webls".
I find the webls program, which is also a script. Aha! There's the telltale code which produces the error—"echo $sub: web file not found". Hmmm, the $sub variable must be empty, since the error we got starts with simply a lone colon. Maybe that's the problem. I trace this variable through the program and find a pair of regular expressions that munge the parameter that is passed into the script. The parameter in this case is "/", indicating the root of the web server, and the regular expressions erase this character. Looking at the logic of the script, that seems to be okay. That was a dead end. So I turn my attention to a call to another external program: "webget .../download/$sub 2>/dev/null".
It turns out that this one is a compiled program. I do have access to the source