formalWARE - System Level Requirements Based Testingfwtemplate

seminar outline

last updated December 1, 1997

formalWARE
project

Participating
     Organizations
Research
     Topics
People

formalWARE
results

Overview
Publications
Presentations
Tools
Methods
Examples
Training

formalWARE
information

Events
Index
Links
Contacts

Probabilistic Analysis of Software Testing for a Rare Condition

Lee White
December 3, 1997
Handout and Step-by-Step Analysis:

Problem Statement

Suppose we have a system that monitors a stream of inputs with the objective of detecting a particular condition "C". You can imagine that this system is implemented mainly as software and that the specification and implementation of condition "C" are very complex. For instance, the system could be a computer-based system in a chemical plant that must detect the rare occurrence of a dangerous combination of physical conditions (involving temperature, pressure and viscosity in producing a particular chemical).

To gain confidence that the implementation is correct with respect to its specification, we want to perform some lab testing. In particular, we are interested in false positives, i.e., instances where the system incorrectly reports that the inputs satisfy condition "C". We have three fundamental
questions:

1) How do we quantify (in a mathematical way) the level of confidence that we could have in our lab test results and hence subsequent field test results?

2) How many lab test cases are required to achieve this level of confidence?

3) What assumptions would be needed to make the lab test inputs representative of normal operation?

With respect to 1), we want to determine the number of lab tests that would be needed to conclude that "For 1000 inputs, the system will detect at most one false positive" with some mathematical confidence in this conclusion. Consider an analogy between this problem and the way that opinion polls are used to make conclusions about the outcome of an election. For instance, the conclusion might be "If the system will generate more than one false positive per 1000 inputs during its normal operation, then 19 times out of 20, we will encounter at least one false positive by applying a set of n lab tests to the system." Is this an appropriate way to quantify our confidence?

With respect to 2), how do we compute n in the above statement? Ideally, we would like to have a formula with a variable R for the "acceptable rate for false positives", e.g., R = 0.1% (i.e., 1 in 1000 false positives).

With respect to 3), for the analogy with opinion polls to hold up, we will have to assume that the lab test cases are independent except that they are also representative of the kinds of inputs that would be applied to the system during normal operation. What are the issues which need to be taken into account to ensure this is correct, and what does this imply about the statement of this problem?

First Model:

1) Description of 4 Cases:

False Positive Condition: prob. p1, very small
Misclassification of C Condition: prob. p2, very small
C Detected Correctly: prob. p3, small
Not C Detected Correctly: prob. p4, large.

2) Math detail: How to represent the probability of this entire model?

This will be a multinomial distribution in these four probabilities.

3) However, because these four events are mutually disjoint and exhaustive, we can obtain the marginal distribution for p1 simply as a Binomial distribution. As the problem is symmetric in all the variables, the same analysis will work for all cases, most importantly for p2.

4) Now either we know probability p1 a priori, or we don't; next consider what is involved in estimating p1, and the number of tests required.

Estimate p1 Parameter for First Model:

5) Before we answer the question involving the number of lab tests, we need to answer another question in the problem statement: What are the assumptions needed to make lab tests representative of normal operation?

a) Each test should be statistically independent of other tests.

b) Statistical properties of the system should not vary with time.

c) We need to know the OPERATIONAL INPUT DISTRIBUTION of the system.

6) The most important and subtle of these assumptions is c): it involves a complete description of the probabilities and assumptions of all the inputs to a system. An example is given of a telephone switching system.

7) For our analysis, we will assume the first model describes the operational input distribution, unless we are given more information.

8) The Binomial distribution is rather tedious to use for computation. We will try to approximate the Binomial distribution with either the Normal or the Poisson distribution. If n*p1*(1-p1) > 50, we can use the Normal approximation, which is easy to compute.

9) In deciding how many tests are required, we need to get an estimate p1' which is close to p1:

|p1 - p1'| < DELTA

10) But no testing sampling size can guarantee that p1' is this close; the best we can do is to require that the probability ALPHA of satisfying this bound be extremely large; e.g., ALPHA=95% is standard.

11) How small should DELTA be? It makes sense that if p1 is 0.001,then DELTA should be 0.00025, or more generally, 25% of the estimated value of p1.

12) Once the problem has been set up in this way, with values of DELTA and ALPHA selected, there are a number of methods for estimating n; we will use and illustrate one such method; the details of doing this estimation are not as important as setting up the problem.

13) The number of tests n is now given for p1=0.01, 0.001, and 0.0001. It is shown that the factor n*p1*(1-p1) is 130 for our assumptions, well above the criterion of 50 or more; so the Normal approximation assumption is justified.

Analysis of First Model:

14) Now that we can assume that p1 is known accurately, what is the effect of running tests on the system, and how can we describe the outcome of these tests? Again, how many tests are required for certain levels of false positive rates?

15) Again use Poisson and Normal approximations to the Binomial but this time we will find that the number of tests are smaller, and the Normal approximations are not so accurate; the Poisson approximation will be, however.

16) Tables are given for a rate of fewer than:

1.5 and 2.0 false positives/1000 tests;

the tables show how to interpret the data analogous to that of polling data, and the number of tests are large, but not as large as estimating very small probability parameters.

Second Model:

1) Given condition C:

pc: Probability that a random test involves condition C (small)

pfp: Probability of reporting C when C has not occurred (related to false positives, and very small)

pmc: Probability of reporting not C when C has actually occurred (related to misclassifying C condition, and very small)

2) For n tests, the expected number of false positives is:

n*pfp*(1-pc)

and the expected number of misclassifications of condition C is:

n*pmc*pc

3) This is a more complex model, but satisfies Jeff Joyce's intuition that the solution should depend upon pc (corresponds to his R parameter).

Analysis of Second Model:

4) We encounter a new problem with this model if we try to estimate probability parameters to find the probability of false positives (or probability of misclassification of the C condition):

How to estimate pc?

Submit random tests, but because of the misclassification of C and false positives, we cannot determine pc directly.

5) Assume that we can control the C condition during the lab tests; then here is a method to determine the desired probabilities:

a) Find pfp ACCURATELY with the C condition off; count positives; how many tests?

b) Find pmc ACCURATELY with the C condition on; count negatives.

c) Next submit random tests and count positives; these positives should be calculated by the following formula:

n[pc*(1-pmc) + pfp*(1-pc)]

d) Then solve for pc.

e) Finally evaluate pfp*(1-pc) for false positive rate.

6) Discuss Sampling Inspection Problem (if time), and show relevance to the Second Model:

Two Cases: Defective item or not (analogous to condition C)

Two Cases: Inspected item or not (analogous to reporting condition C)

(Note: Second Model is slightly more complex, since there are four cases analogous to inspection, not two.)

Consider a conveyor belt containing items, both satisfactory and defective;
Inspector randomly selects items to inspect;
Process goes on until a defective item is found by inspection; then stops;
Ask two questions about the process:
How many items have passed by on the conveyor during inspection?
How many undiscovered defective items have passed by uninspected?

Conclusions