After my first experiment with FsCheck and property-based testing, I was annoyed.
The Haskell programming language has been around for a while, but I had never used it. What seized my attention was a tool called QuickCheck and the paradigm it introduced. Because I had been working in the .NET space, I opted to investigate FsCheck, an F# port of QuickCheck. My annoyance came from demonstrating a property of list collections.
When we reverse a list, then reverse it again, we expect to get the initial list back, right? This property that should be true without regard to how long the list is or what it contains. Imagine my surprise when my second test with FsCheck failed. The first test with a list of integers passed, but when I changed it to hold floats, it failed.
FsCheck generated random values to populate the list, then checked to see if the property being tested held. It did so many, many times, looking for one combination of values that fails. In the case of integers, it isn’t just the 1, 2, 3… we first think about, but also the largest and smallest integers that can be stored in a fixed-point integer, zero, and negative numbers. We must also consider empty lists. FsCheck did all this. Why did it fail when we consider floating-point numbers?
The explanation comes from the definition of floating-point values. They take a pattern of bits and interpret the pattern to mean a number with a decimal point. However, not all possible combinations of bits can be meaningfully interpreted this way. We term these troublesome patterns NaN: Not a Number. To my surprise, NaN variables can be bit-by-bit identical, but not equal. Let that sink in. This seems justifiable, but it is counterintuitive—and easily overlooked.
We like to hope that we will consider all such pathological situations when devising our tests, but it’s all too easy to overlook something like NaN. That’s the benefit of random test generators like FsCheck. We might feel comfortable after testing a few dozen test cases. FsCheck generates hundreds. With more stuff getting tossed at the wall, there is a greater likelihood that something interesting sticks.
After FsCheck finds a failure, it takes a second step: It tries to reduce the amount of data needed to evoke the problem. I won’t dwell on this second step, because my point is that seeing errors like I just described with NaN helps us think more deeply about what a codebase is doing—particularly after reducing its tests to their smallest form.
In the case of a list of floating-point values, we might dismiss the NaN business as bogus. We might even be right to do so. But the point of running tests is to make us think about our codebase. Maybe we should add something on the edges of our system to keep NaN values out of it, or we should assure ourselves that NaN cannot arise from any transformation within it.
Our code has bugs because our thinking about it has blind spots. And even the world’s greatest automatic test-case generator will have blind spots, too. Happily, the machine and the human have different blind spots.
Tools like FsCheck or QuickCheck leverage a different paradigm of testing that uses the machine’s strengths to complement our weaknesses. Machines can algorithmically formulate massive sets of test data that would be expensive and boring to assemble manually. But they need property-based tests to consume all that data.
When I was a new programmer, I did not appreciate this. I was working on a statistical application that leaned heavily on Bayesian statistics and probability. There are some things you can always rely on with conditional probabilities. For one thing, probabilities always fall in the range of zero to one. For another, when we sum up all the conditionals, we should get one. These properties are something we can check automatically, no matter what we toss at the code.
Likewise, I recently had a conversation with some friends building a billing system. I asked some questions: Will all the line items be positive? Can we know that a subtotal of many line items will exceed a subtotal of fewer items? Can the totals change without respect to the order in which they’re summed? I asked what other properties remain invariant that we can assert against. Every such invariant can serve as a basis of property-based tests. You can bet I will be throwing some negative prices at this system.
A broad spectrum of property tests can teach us a lot about our codebase. Testing never establishes that a codebase is error-free, but testing does narrow where errors can hide. To learn the most, we must understand what the tests do and what they show. This can mean we should focus on understanding our tests more than on multiplying tests that only add marginally to our understanding. Just like FsCheck reduces large failing tests to simpler forms, we ought also to periodically review our test suites.
Does each test tell us something more than just the circumstances reproducing bug reports of years past? Can such tests be reorganized to clarify why we care about them? Our objective should be to curate tests to maximize the amount of understanding gleaned from each tests success or failure.
I was surprised by NaN when my property-based test on floats failed. It was not because of a malfunction of the list reverse; it failed because of an assumption inherent within the test about how floating-point numbers are compared. This reminds us to seek out data that contradicts our assumptions. The more surprises we can incorporate into our testing, the more robust our systems will become. This entails marrying multiple diverse paradigms and capabilities to leverage the strengths of each.
Being surprised by NaN might seem to be bad news. It tripped us up with a failed test only indirectly related to the code we’re testing. But doing so reminds us of something about the software we may have forgotten. Such reminders transform surprise into good news.