Resources Events Topics PowerPass Jobs

Better Software Home > In This Issue > Featured Article

March 2007

 2013 2013 March/April 2013 January/February 2013 November/December 2012 September/October 2012

Beat The Odds
by Joel Spolsky

You know that old saying that the best way to schedule software development is to come up with your best guess and multiply by three? And you know how that always seems to work?

Well, I think I've figured out why that happens. As it turns out, it's not just because the world is funny—ha-ha!—and things always go wrong. I’m pretty sure there’s a mathematical reason that schedules don't add up right unless you "multiply by three," and once you understand it, you can create much more accurate software schedules that actually are grounded in real evidence. I call it evidence-based scheduling, or EBS for short.

You know that old saying that the best way to schedule software development is to come up with your best guess and multiply by three? And you know how that always seems to work?

A Little Intuition
Let’s take a super simple example just to get started. Suppose you have to do a really short software project consisting of two tasks. Each task, you estimate, should take eight hours (one work day).

As I said, there are two tasks.

Eight hours each.

OK, now, how long will the whole project take?

No rush . . . I'll still be here. It's not a trick question. You do the math.

(Jeopardy music)

And then you try to do it, and lo and behold, it takes three days.

HA-HA! It was a trick question!

What went wrong?

I'll explain. Software estimates actually contain some uncertainty. It is a little bit too simple to say that a given task is going to take "eight hours." It is far more realistic to say things like, "There is a 50 percent probability that we can get this done in eight hours." That is a very different statement!

"OK, so I might go over," you say. "But I might also go under, and on a long project all the tasks that came in late will be balanced out by the tasks that came in early!"

Aha!

That, precisely, is where you are wrong.

Think about what kinds of unpredictable things happen with a software development task:
• You discover that the library function you're going to use won't work.
• You realize that you also have to do some character encoding you hadn’t planned on.
• When you've built it, it's too complicated for the end-users.
• A new version of Visual Studio comes out, which you spend the day downloading and installing.
• And, on very rare occasions:
• You discover a shortcut.
Briefly, there are a lot of ways to slip and not very many ways to come in early.

When you think about that eighthour task, sure, it could come in a lot sooner (say, four hours), or a lot later (say, twelve hours). But could you do it in zero hours? No.

Could you get it done in minus twelve hours? Completely impossible! You'd have to go backward in time!

Can you imagine an eight-hour task taking sixteen hours? Sure, it happens all the time. Could it take forty hours? Yep, things go wrong; sometimes eight-hour tasks take a whole week. Not often, but it can happen.

Rather than using fixed times ("eight hours") for estimates, it makes more sense to think about probabilities. What is the probability that this task will take four hours? Eight hours? Sixteen hours?

For each task, then, there's a probability distribution curve—a mathematical curve that tells you what the chances of completing it will be within any given amount of time. For example, figure 1 shows a task that has been estimated at eight hours.

The precise numbers in this chart are made up, but I think they’re realistic. Again, when you think about the probability distribution curve for a software task, there are a lot more ways to come in late than there are to come in early; and when it does come in late, it comes in a lot later than it comes in early when it is early.

Now, watch closely, this is the cool part! If you were stuck, like Bill Murray in Groundhog Day, doing this task again and again and again, and each time it took a different amount of time, but it basically fit this curve (50 percent of the time it took less than eight hours, etc.), what would be the average time to complete this task?

Well, you can do this in Excel, and I’ll spare you the gory details, but with this curve, depending on the random numbers you happen to generate, this task takes an average of between fifteen and sixteen hours.

That's weird, you say. There's a 50 percent chance that it’ll be done in eight hours. Why does it take an average of more than fifteen hours?

Even though it's equally likely to be early or late, as I said before, when it's late, it can be much later than it can be early when it's early. So you wind up with an average time of more than fifteen hours.

Let me repeat that in case you've been dozing off, which, believe me, you deserve to be. Just because the "most likely" case is eight hours, and just because there's a 50 percent chance of taking less than eight hours and a 50 percent chance of taking more, the expected amount of time this feature will take is actually more than fifteen hours!

Now, let's say you have two eighthour features. Even though it seems like 8+8=16, the probability distribution of when you're going to finish actually looks like the one in figure 2, which shows that there's only a 35 percent chance of coming in with sixteen hours. So, while I hate to say that eight plus eight is not equal to sixteen, with schedules it’s more likely that 8+8=20. Or, to be more specific, if you have two features that each have a 50 percent chance of coming in within eight hours, and you need to do them both, you have a 50 percent chance of being able to do that in twenty hours.

And the more small tasks you're planning to do, the more likely you will be delayed in this way.

And that is the reason why you can't just sum up a bunch of schedule items and get a correct estimate.

How Did I Calculate the Chart in Figure 2?
Using a statistical method called Monte Carlo simulation, you can actually add up a bunch of tasks. The basic idea behind the Monte Carlo method is to calculate your schedule again and again—say, 1,000 times. Each time through the simulation, you use random numbers instead of the actual numbers. And not just any random numbers: You actually generate random numbers that are distributed with the same probability as the underlying task.

To take a very simple example, imagine a task that will take:

So now, you need a random number generator that actually produces the distribution of numbers. It needs to produce "eight hours" randomly about twice as often as it produces four or sixteen.

If your schedule consists of hundreds of individual tasks, each task has its own probability distribution. On each iteration, you generate an appropriate random time for the task and then you add these up.

At the end of the iteration, you get the total amount of time all these tasks took, according to the current iteration.

When you run the simulation 1,000 times, you get 1,000 slightly different ship times. Each ship time will occur with 1/1,000 probability.

Now when your boss asks you, "Will we ship by April 4?" you can answer, "Maybe! Out of 1,000 simulations, in 342 of them the ship date was on or before April 4. That means we will ship by April 4 with 34.2 percent probability."

"That’s not very good," says your boss. "What date will we ship by with 95 percent likelihood?"

And you can answer, "Well, when I sorted out the simulations in order of ship date and picked simulation 950 on the list—such that 95 percent of the ship dates were earlier and 5 percent were later—that simulation has us shipping on October 14."

In fact, you can generate a curve that shows the probability of shipping by any given date.

And that's really useful.

How Do You Generate the Probability Curve for a Single Task?
One thing you can do is ask the developer who is estimating the task. This seems like a lot of work for the developer and is unlikely to come up with anything reliable or reasonably correct.

Instead, I suggest that you actually look at that estimator's history. For example, suppose Michael, who has a long history of estimating, has estimated this task at eight hours.

When we look back through our detailed records, we see that the last five times he thought something would take eight hours, it actually took six hours, then eight hours, then eight hours again, and then ten hours, and once it took fourteen hours.

Ta-da! Now you know Michael's probability curve for eight-hour tasks. When you run the Monte Carlo silmulation whenever Michael thinks something will take eight hours, you should actually use this probability distribution:

This is why I call this method evidencebased scheduling. What you're doing is looking at evidence of what actually happened in the past when this developer estimated a feature at eight hours, and you're using it to run a detailed simulation with real probabilities and make conclusions about the probability distribution of ship dates that are entirely rational.

The neat thing about this method is just how well it degrades. If you have one developer who has a ridiculous habit of underestimating every task by a factor of five, evidence-based scheduling detects this and still produces the correct result. For example, for eight-hour features, this developer's probability distribution looks like this:

Every time through the simulation, that developer's estimate of eight hours will be interpreted as taking forty hours—one calendar week—because that’s what has happened in the past, and you’ll get correct output.

More commonly, you might have a terrible estimator who is often way off the mark. For eight-hour features, his probability distribution is:

It almost seems random. The net result of this weak estimator is actually just a bunch of uncertainty in the schedule, but the Monte Carlo simulation will preserve this uncertainty and give you a final probability curve that reflects all of that uncertainty. It's frustrating not to know exactly when you’re going to ship, but the truth is that this method is still extremely useful because instead of just giving you a date, it gives you a probability curve so you can clearly see how reliable that date is.
• If you have a lot of good estimators, you'll see a very tight range of possible ship dates. In fact, if all your estimators are historically perfect, you’ll see a single ship date with 100 percent probability, as you might expect.
• The more poor estimators you have, the wider the spread of ship dates. The 40 percent likelihood ship date may be months earlier than the 60 percent likelihood ship date.
Evidence-based scheduling degrades gracefully with the quality of your team's estimating abilities.

Using All Available Data
This method works best when you have a long historical record for how good your developers are at estimating things. You're going to need to gather timesheet data on every task that the developer did, with both the estimate and the actual elapsed times.

Depending on the granularity of your tasks, after a couple of months of collecting data, you'll probably have about ten data points for each developer. Each data point consists of an estimate and an actual time. Divide them to get velocities. For example, a task that was estimated at ten hours but took twenty hours was done at velocity 0.5. A task that was estimated at forty hours but took thirty hours was done at velocity 1.333. Once you have five or six velocities for a given developer, you can start doing Monte Carlo simulations for tasks of any length.

On each iteration through the Monte Carlo simulation, you need to generate a random velocity that has the same distribution as that developer's historical velocities.

For example, suppose Michelle has finished ten tasks:

Next, plot all the velocities, in order, on a little chart. The x axis is just numbered 0 through n-1 for n tasks. The y axis is the velocity. You should have a monotonically-increasing chart (see figure 3).

For each round of the Monte Carlo simulation, we'll need to take the output from a uniformly distributed randomnumber generator, which generates all numbers 0 ≤x < (n-1) with equal probability, and convert those to velocities that are distributed with the same probability as Michelle's historical record. Given this chart, just look up the random number on the x axis, see where it intersects the curve, and find the corresponding velocity on the y axis.

That gives you your randomly generated velocity. To get a simulation for that task, all you have to do is multiply that number by the estimate for a future task.

For example, if the evenly distributed random number were two, the velocity we use is 0.81. So if Michelle is estimating a task at ten hours, for this iteration, pretend it will take 8.1 hours.

Overall, then, to use the Monte Carlo method for evidence-based scheduling, use this pseudocode algorithm:

Now you have seen 1,000 possible futures, each with equal probability. Sort these futures by ship date to make a probability curve showing the chances that you will ship by any particular date, and you're done!

Implementation Tips
Over time, developers are likely to become better and better at estimating their own tasks, especially if you give them historical feedback. Our implementation of evidence-based scheduling shows developers a chart that plots their recent estimates against the actual time recorded to implement each task, so outliers are easy to find. Monitoring this chart over time helps developers learn to estimate more accurately.

In looking at developers' historical velocities, you might want to use only the most recent ones. That way, as developers get better, those older estimates—made years ago when they were young and naïve—will no longer affect their new estimates.

We've heard too many stories of managers who try to base performance reviews on bug counts in the bug-tracking system. Whenever this happens, developers immediately learn not to put bugs into the bug tracker. Every bug report becomes an argument ("That's not a bug! It's a feature!"), and eventually the bug tracker withers and dies.

Similarly, the minute you implement evidence-based scheduling, you run the risk of management trying to use the data that you collect to rate developers for performance reviews, bonuses, etc. This is crutch for doing legitimate performance reviews, developers will start gaming the system. Suddenly, instead of getting useful estimate data that can be used to plan your projects, you'll start getting fictional data carefully crafted to make developers look good on their performance reviews. Generally, any attempt to use project management tools to collect performance review data is guaranteed to lead to manipulation of the project management system in a way that assures everyone looks good, while rendering the project management system completely useless.

Summary
Even when task estimates are done well, and they actually reflect the most likely amount of time that the task will take, you can't simply sum up estimates mathematically to get the project ship date, because when tasks go over, they go all this, using evidence-based scheduling and the Monte Carlo method, you can simulate your future schedule and generate a precise probability curve of possible ship dates that is far more informative and accurate than you get from naïve methods. {end}