State coverage
November 19, 2007
A widespread conception is that programs really have two sides: their source code and their runtime.
With most languages, the distinction is made quite clear by the role of the compiler: source code cannot be executed without being first compiled. But in some languages, LISP for example, both sides tend to merge. The trend in computer languages nowadays is to become more and more dynamic, implying that the distinction between source code and runtime execution becomes less and less clear. One often sees programs written in dynamic languages generating and compiling parts of themselves during runtime.
That affects the way software should be tested.
Historically, software testing has focused on testing the source code. Back at the dawn of software development, when developer time was way cheaper than computer time, most testing was made by reviewing the sources by hand. Source code review is still a widespread way of testing code, but accepted as only one technique among many others. Some programmers test their programs by running them a number of times with various input. More serious software teams write automated regression tests, have some form of periodic test build or practice some variants of test driven development. Some have dedicated test teams. Some even use code coverage.
Code coverage is a technique that shows how much of a program's source code is really executed when the program runs. A coverage report typically looks like a copy of the program's source code with annotations (number of executions, boolean states, etc.) beside every line of code and every logical branch. Code coverage is a powerful tool when trying to write regression tests that cover as much as possible of a program's source. Still, code coverage, as the name states, mostly focuses on the source code.
So source code can be tested and verified in an almost provable way. But source code, except in the hypothetical case of a perfect program, almost never covers all possible situations that can occur during runtime. That leaves us with a huge and increasingly growing gap of untested program behavior: things that happens at runtime; code that re-compiles itself, internal caches that self-alter, IO errors, interrupts, hamsters gnawing at network cables, solar storms and other butterflies cruising on the other side of the earth.
What software testing should focus on is not so much the source code, but rather the states of a program. A running program can get into a huge number of different states. Some are very likely to happen, while most are extremely unlikely.
Some states are data induced: a given input leads to a given sequence of states in the program which leads to a given output. That is quite easy to test: just write a regression test that feeds your program some data and checks the program's reaction. Kind of.
But what about all the other states?
Let's take an example: let us assume you want to test what happens when your program tries to open a file while the operating system runs out of filehandles. That is tricky, but doable. In fact, your test will look very much like a classic exploit out of a black hat's toolbox. Now multiply this by the number of places in your code at which files are opened. It gets harder. And still, we are assuming in the first hand that you actually got the idea of testing that particular rather unlikely situation. And honestly, how often have you seen a developer team writing a test for filehandle exhaustion? I haven't. Yet, this specific situation has been used to gain local root privilege in a number of well documented exploits.
Which brings us to the following point: some states of a program can be really hard to test, but the hardest part is to know which states are relevant to test. In practice, a program can take such a large number of states that it is simply impossible to enumerate them all. This implies that you will never be able to reach 100% state coverage. Which in turn means that you will never be able to prove a program completely bug free.
Full state coverage is an utopia.
Contrary to code coverage, you can't even measure state coverage since the number of states is possibly infinite.
But you can still strive to increase state coverage. And you should.
Starting with improving code coverage sounds right, since code coverage covers a relevant subset of all possible program states. For the other states, it is up to you and the other developers in your team to judge of which states should be tested. It quickly becomes a matter of knowledge and experience: you need to have wrestled with some particular unlikely states in order to think of testing them in the future. Which brings us back to the long discussion of developer's craftsmanship...
Notice too that though dynamic languages make our life harder by increasing the number of states that should be tested, they also provide us with efficient tools to test them. Imagine testing a program's reaction to network failure. With a non-dynamic language you will end up running your program in a scripted virtual machine. With a dynamic language you will just redefine the network api at runtime. Much easier.
With most languages, the distinction is made quite clear by the role of the compiler: source code cannot be executed without being first compiled. But in some languages, LISP for example, both sides tend to merge. The trend in computer languages nowadays is to become more and more dynamic, implying that the distinction between source code and runtime execution becomes less and less clear. One often sees programs written in dynamic languages generating and compiling parts of themselves during runtime.
That affects the way software should be tested.
Historically, software testing has focused on testing the source code. Back at the dawn of software development, when developer time was way cheaper than computer time, most testing was made by reviewing the sources by hand. Source code review is still a widespread way of testing code, but accepted as only one technique among many others. Some programmers test their programs by running them a number of times with various input. More serious software teams write automated regression tests, have some form of periodic test build or practice some variants of test driven development. Some have dedicated test teams. Some even use code coverage.
Code coverage is a technique that shows how much of a program's source code is really executed when the program runs. A coverage report typically looks like a copy of the program's source code with annotations (number of executions, boolean states, etc.) beside every line of code and every logical branch. Code coverage is a powerful tool when trying to write regression tests that cover as much as possible of a program's source. Still, code coverage, as the name states, mostly focuses on the source code.
So source code can be tested and verified in an almost provable way. But source code, except in the hypothetical case of a perfect program, almost never covers all possible situations that can occur during runtime. That leaves us with a huge and increasingly growing gap of untested program behavior: things that happens at runtime; code that re-compiles itself, internal caches that self-alter, IO errors, interrupts, hamsters gnawing at network cables, solar storms and other butterflies cruising on the other side of the earth.
What software testing should focus on is not so much the source code, but rather the states of a program. A running program can get into a huge number of different states. Some are very likely to happen, while most are extremely unlikely.
Some states are data induced: a given input leads to a given sequence of states in the program which leads to a given output. That is quite easy to test: just write a regression test that feeds your program some data and checks the program's reaction. Kind of.
But what about all the other states?
Let's take an example: let us assume you want to test what happens when your program tries to open a file while the operating system runs out of filehandles. That is tricky, but doable. In fact, your test will look very much like a classic exploit out of a black hat's toolbox. Now multiply this by the number of places in your code at which files are opened. It gets harder. And still, we are assuming in the first hand that you actually got the idea of testing that particular rather unlikely situation. And honestly, how often have you seen a developer team writing a test for filehandle exhaustion? I haven't. Yet, this specific situation has been used to gain local root privilege in a number of well documented exploits.
Which brings us to the following point: some states of a program can be really hard to test, but the hardest part is to know which states are relevant to test. In practice, a program can take such a large number of states that it is simply impossible to enumerate them all. This implies that you will never be able to reach 100% state coverage. Which in turn means that you will never be able to prove a program completely bug free.
Full state coverage is an utopia.
Contrary to code coverage, you can't even measure state coverage since the number of states is possibly infinite.
But you can still strive to increase state coverage. And you should.
Starting with improving code coverage sounds right, since code coverage covers a relevant subset of all possible program states. For the other states, it is up to you and the other developers in your team to judge of which states should be tested. It quickly becomes a matter of knowledge and experience: you need to have wrestled with some particular unlikely states in order to think of testing them in the future. Which brings us back to the long discussion of developer's craftsmanship...
Notice too that though dynamic languages make our life harder by increasing the number of states that should be tested, they also provide us with efficient tools to test them. Imagine testing a program's reaction to network failure. With a non-dynamic language you will end up running your program in a scripted virtual machine. With a dynamic language you will just redefine the network api at runtime. Much easier.