D. Advanced Topics Copyright 2014 Norsys Software Corp.

2. Testing nets with cases

The purpose of this test is to grade a belief network using a set of real cases to see how well the predictions or diagnosis of the net match the actual cases. It is not for decision networks.

The test allows you to spot weaknesses in your net. With it you can find the nodes whose predictions are weakly correlated with reality. You may want to reexamine their conditional probability tables, supply additional data for learning, or not make use of their predictions until their performance improves.

The basic idea of this test is that we divide the net's nodes into two classes: observed and unobserved. The observed nodes will be "observed" in the sense that they will have their values read from the case file. These observed values are then used to predict the values of the unobserved nodes by using bayesian belief updating. This process is repeated for each case in the case file. For each such case, we compare the predicted values for the unobserved nodes with those that were actually observed in the case file. We record all successes and failures. These statistics are gathered up and presented in a final report that describes, for each unobserved node, how well it performed, that is, how often the predictions were accurate.

The procedure for performing the test is as follows:

  1. Select the nodes you do not wish the network to know the value of during its inference. These are the unobserved nodes. For example, if the network is for medical diagnosis, you might select the disease node and nodes representing other unobservable internal states. Often you are interested in how the net behaves in a realistic prediction setting, so you choose as unobservables what in a real-world context would be unobservable. But you needn't be restricted by this rule. Any subset of nodes can be amongst the unobserved nodes. There must always be at least one unobserved node, of course, since otherwise there is nothing to predict and the test would be pointless.
  2. Choose Network->"Test Using Cases". You will be asked which case file to use, and after you choose one, Netica will start processing. The Messages window will come to the front and display the fraction of cases processed so far. Hold down CTRL+SHIFT+LEFT BUTTON at the same time if you want to stop processing cases and print the results obtained so far.

When Netica is done, it will print a report for each of the unobserved nodes (an example report and its analysis appears at the bottom of this page). The report includes the following:

confusion matrix
error rate
calibration table
logarithmic loss score
quadratic (Brier) score
spherical payoff score
surprise indexes
test sensitivity

For binary nodes it also reports:

test specificity predictive value predictive value negative

Contact Norsys for an easy way to produce receiver operating characteristic (ROC) plots.

The following section explains in greater detail each section of the report. This is done in the context of a sample report, for ease of illustration.

Sample Report

Here is an example report for the node named "SpkQual" (with node title "Spark quality"), circled above, in the tutorial net CarDiagnosis.dne.

For SpkQual:       Spark quality

    good     bad  very_b    Actual
  ------  ------  ------    ------
     253       0       0    good
      22     176       4    bad
      13      19     430    very_bad

Error rate = 6.325%

Scoring Rule Results:
  Logarithmic loss = 0.2144
  Quadratic loss   = 0.1099
  Spherical payoff = 0.9409

  good      0-0.5:   0    | 0.5-1:    0    | 1-2:     0    | 2-5:     0    |
            5-80:    49   | 80-95:    87.5 | 95-98:   95.7 | 
  bad       0-1:     0    | 1-2:      1.52 | 2-5:     2.4  | 5-10:    5.17 |
            10-50:   20   | 50-85:    82.6 | 85-95:   90   | 95-100:  100  | 
  very_bad  0-0.1:   0    | 0.1-0.5:  0    | 0.5-5:   6.94 | 5-10:    9.33 |
            10-20:   16.2 | 20-95:    83.3 | 95-98:   98.9 | 98-99:   100  |
            99-100:  100  | 
  Total     0-0.1:   0    | 0.1-0.5:  0    | 0.5-1:   0    | 1-2:     0.431|
            2-5:     2.5  | 5-10:     6.28 | 10-15:   10.9 | 15-20:   13.3 |
            20-50:   30.1 | 50-80:    81.5 | 80-90:   86   | 90-95:   93.7 |
            95-98:   97.6 | 98-99:    100  | 99-100:  100  | 

Times Surprised (percentage):
               .................Predicted Probability...................
  State        < 1%             < 10%             > 90%            > 99%
  -----        ----             -----             -----            -----
  good         0.00 (0/312)     0.00 (0/614)      6.86 (14/204)    0.00 (0/0)
  bad          0.00 (0/225)     1.98 (13/657)     0.00 (0/69)      0.00 (0/0)
  very_bad     0.00 (0/216)     3.32 (12/361)     0.25 (1/399)     0.00 (0/31)
  Total        0.00 (0/753)     1.53 (25/1632)    2.23 (15/672)    0.00 (0/31)

Analysis of Sample Report

Confusion Matrix. The possible states of Spark Quality are good, bad and very_bad. For each case processed, Netica generated beliefs for each of these states. The most likely state (i.e. the one with the highest belief) was chosen as its prediction for the value of Spark Quality. This was then compared with the true value of Spark Quality for that case, providing the case file could supply it. The confusion matrix supplies the total number of cases in each of the 9 situations: (Predicted=good, Actual=good), (Predicted=bad, Actual=good), etc. If the network is performing well then the entries along the main diagonal will be large compared to those off of it.

Error Rate. In our sample report, the error rate is 6.325%. This means that in 6.325% of the cases for which the case file supplied a Spark Quality value, the network predicted the wrong value, where the prediction was taken as the state with highest belief (same as for the confusion matrix).

Scoring Rule Results. This score doesn't just take the most likely state as a prediction, but rather considers the actual belief levels of the states in determining how well they agree with the value in the case file. These results are calculated in the standard way for scoring rules. For more information see any reference on scoring rules, such as Morgan&Henrion90 or Pearl78.

Logarithmic and Quadratic loss, and Spherical Payoff. The logarithmic loss values were calculated using the natural log, and are between 0 and infinity inclusive, with zero indicating the best performance. Quadratic loss (also known as the Brier score) is between 0 and 2, with 0 being best, and spherical payoff is between 0 and 1, with 1 being best.

Their respective equations are:

Logarithmic loss= MOAC [- log (Pc)]
Quadratic loss = MOAC [1 - 2 * Pc + sum[j=1 to n] (Pj ^ 2)]
Spherical payoff= MOAC [Pc / sqrt (sum[j=1 to n] (Pj ^ 2))]

where Pc is the probability predicted for the correct state, Pj is the probability predicted for state j, n is the number of states, and MOAC stands for the mean (average) over all cases (i.e. all cases for which the case file provides a value for the node in question).

Calibration. This indicates whether the confidence expressed by the network is appropriate (i.e. "well calibrated"). For instance, if the network were forecasting the weather, you might want to know: Of all the times it said 30% chance of rain, what percentage of times did it rain? If there were lots of cases, the answer should be close to 30%. For each state of the node there are a number of items separated by vertical bars (|). Each item consists of a probability percentage range R, followed by a colon (:) and then a single percentage X. It means that of all the times the belief for that state was within the range R, X percent of them the true value was that state.

For instance

    rain   0-10:  8.5 |

means that of all the times the belief for rain was between 0 and 10%, 8.5% of those times it rained. The reason that the probability ranges are uneven, and different from state to state, and run to run, is that they are chosen so that the X percentages are reasonably accurate. The bin sizes have to adapt, or there might not be enough cases falling in that bin. The more cases you process, the more fine will be the probability ranges. Calibration results are often drawn as a graph (known as a "calibration curve") where ideal calibration is a straight diagonal line. For more information, see a text which discusses probability "calibration" for example, Morgan&Henrion90,p.110, referenced above.

Times Surprised Table. This indicates how often the network was quite confident in its beliefs, but was wrong. There are columns for being 90% confident and 99% confident (i.e. beliefs are greater than 90% or 99% respectively), and also for being 90% and 99% confident that the value of the node will _not_ be a certain state (i.e. beliefs are less than 10% or 1% respectively). The ratios indicate the number of times it was wrong out of the number of times it made such a confident prediction, and a percentage is also printed. If the network is performing well these percentages will be low, but keep in mind that it is very reasonable to be wrong with a particular 10% or 90% prediction 10% of the time, and to be wrong with a particular 1% or 99% prediction 1% of the time. If the network rarely makes strong predictions (i.e. beliefs are rarely close to 0 or 1), then these most of these ratios will be 0/0.

Quality of Test(binary nodes) or Test Sensitivity (nState>2) These reports are useful when the output of the network is going to be used to decide an action, with one action corresponding to each state of the node. As a medical example, the node may be "Disease-A" and have the two states "Present" and "Absent". If, after updating for a case, the network reports "Present", then a particular treatment will be started, but if it reports "Absent" then the treatment won't be started. The question is, at what probability for "Present" should we say that the network is reporting Present?

The confusion matrix and error rate discussed above were determined using the maximum likelihood state (i.e. the one with highest belief after updating). For a binary variable, this means choosing the first state only if its belief is higher than 50%. But if each state has a different cost of misclassification, you may not want the cutoff probability to be 50%. In the medical example, it may be disastrous to not treat a patient who has the disease, but not that serious if he is treated unnecessarily. So you would like the network to report "Present" if the probability of the disease is above some small number, like 2%. It is a matter of trading off the rate of false positives against the rate of false negatives.

Ideally you would just convert the network to a decision network, by adding a decision node for the action to be taken and a utility node for the cost of misclassification. However, at the time the network is constructed and being graded as to its usefulness, the utilities may not be known.

The "Quality of Test" section has performance results for a series of cutoff threshold probabilities (which run vertically in the first column). For each case, the beliefs given by the network are converted to a "prediction". The prediction is "first state" if the belief for the first state is higher than the cutoff probability, and "second state" if it's lower. You may want to change the order of the states, so that the first state is the "positive" one, to better match conventional meanings. The meanings of the columns are:

Sensitivity Of the cases whose actual value was the first state, the fraction predicted correctly.
Specificity Of the cases whose actual value was the second state, the fraction predicted correctly.
Predictive Value Of the cases the network predicted as first state, the fraction predicted correctly.
Predictive Value Negative Of the cases the network predicted as second state, the fraction predicted correctly.

Often this data is summarized with a graph called the ROC (receiver operating characteristic) curve. To use Excel (available from Microsoft) to create the ROC curve from this data, select the whole table (except headings) and while holding down the key, type tcz. Then open the Excel file called "Graph_ROC.xls" (available from the Norsys file downloads directory), paste into the indicated cell, and the graph will be drawn.

If the node has more than 2 states, instead you will get a "Test Sensitivity" section. The first number of each "column" is the cutoff threshold probability. The second number of each column is the number of cases whose actual value was the state given at the left hand side of the row, and which the network correctly predicted to be in that state (i.e. its belief was greater than cutoff probability), divided by the total number of cases whose actual value was that state.

It may seem awkward that the cutoff probability changes in strange sized jumps. The reason is that Netica only reports on values for which it was able to gather enough data. So running the test using a greater number of cases generally results in finer divisions of the cutoff column.

NOTES on Net Testing

  • If you have any findings entered before choosing Network -> Test With Cases they will be taken into account during all belief updating (unless the case file has a column for that node). Netica will warn you in this event, so that you don't obtain wrong results by inadvertently leaving some findings in the network. A situation in which you would want to leave a finding in the network is if the network is designed for a broader class of cases than the case file. For example, if you have a network designed to handle people of both genders (and it has a 'gender' node), but the case file contains females only, you should enter a finding of 'female' for the 'gender' node before grading the network.
  • If the findings for the non-unobserved nodes of a case in the case file are impossible according to the network, then an inconsistent-findings error message will be displayed, that case will be ignored, and processing will continue. If the network makes predictions for the unobserved nodes that are inconsistent with the case file, then of course no error messages will be generated, the network will simply be graded more poorly (and have a logarithmic loss of INFINITY).
  • Depending on your application, any of the measures calculated could be the most valuable to you. However, if you want a single number to grade a network, and aren't sure which one to pick, we suggest the logarithmic loss.
  • This function will properly support a 'NumCases' column in the case file, if one is present.
  • As well as grading a network, this feature can also be used to determine the value of combinations of findings nodes in a real-world environment. By selecting extra nodes in the first step, you can make some possible findings from the case file unavailable to the network. Then you can see how much the results of the network are degraded by not having access to those findings. In the medical example mentioned earlier, you might additionally select the nodes 'Blood Test' and 'Smear Test', and then compare the new confusion matrix generated with the old one, to find if the number of false negatives and false positives of serious diseases changed significantly.
  • This feature is also available to programmers using Netica API; contact Norsys for more information.

More complete documentation will be available on Net Testing at a later date. If you have any questions, suggestions, or requests, or if you find any problems with the software as it exists, please email Norsys.

Return to Tutorial Home