|D. Advanced Topics||Copyright © 2015 Norsys Software Corp.|
2. Testing nets with cases
The purpose of this test is to grade a belief network using a set of real cases to see how well the predictions or diagnosis of the net match the actual cases. It is not for decision networks.
The test allows you to spot weaknesses in your net. With it you can find the nodes whose predictions are weakly correlated with reality. You may want to reexamine their conditional probability tables, supply additional data for learning, or not make use of their predictions until their performance improves.
The basic idea of this test is that we divide the net's nodes into two classes: observed and unobserved. The observed nodes will be "observed" in the sense that they will have their values read from the case file. These observed values are then used to predict the values of the unobserved nodes by using bayesian belief updating. This process is repeated for each case in the case file. For each such case, we compare the predicted values for the unobserved nodes with those that were actually observed in the case file. We record all successes and failures. These statistics are gathered up and presented in a final report that describes, for each unobserved node, how well it performed, that is, how often the predictions were accurate.
The procedure for performing the test is as follows:
When Netica is done, it will print a report for each of the unobserved nodes (an example report and its analysis appears at the bottom of this page). The report includes the following:
For binary nodes it also reports:
Contact Norsys for an easy way to produce receiver operating characteristic (ROC) plots.
The following section explains in greater detail each section of the report. This is done in the context of a sample report, for ease of illustration.
Here is an example report for the node named "SpkQual" (with node title "Spark quality"), circled above, in the tutorial net CarDiagnosis.dne.
For SpkQual: Spark quality ----------- Confusion: .......Predicted...... good bad very_b Actual ------ ------ ------ ------ 253 0 0 good 22 176 4 bad 13 19 430 very_bad Error rate = 6.325% Scoring Rule Results: Logarithmic loss = 0.2144 Quadratic loss = 0.1099 Spherical payoff = 0.9409 Calibration: good 0-0.5: 0 | 0.5-1: 0 | 1-2: 0 | 2-5: 0 | 5-80: 49 | 80-95: 87.5 | 95-98: 95.7 | bad 0-1: 0 | 1-2: 1.52 | 2-5: 2.4 | 5-10: 5.17 | 10-50: 20 | 50-85: 82.6 | 85-95: 90 | 95-100: 100 | very_bad 0-0.1: 0 | 0.1-0.5: 0 | 0.5-5: 6.94 | 5-10: 9.33 | 10-20: 16.2 | 20-95: 83.3 | 95-98: 98.9 | 98-99: 100 | 99-100: 100 | Total 0-0.1: 0 | 0.1-0.5: 0 | 0.5-1: 0 | 1-2: 0.431| 2-5: 2.5 | 5-10: 6.28 | 10-15: 10.9 | 15-20: 13.3 | 20-50: 30.1 | 50-80: 81.5 | 80-90: 86 | 90-95: 93.7 | 95-98: 97.6 | 98-99: 100 | 99-100: 100 | Times Surprised (percentage): .................Predicted Probability................... State < 1% < 10% > 90% > 99% ----- ---- ----- ----- ----- good 0.00 (0/312) 0.00 (0/614) 6.86 (14/204) 0.00 (0/0) bad 0.00 (0/225) 1.98 (13/657) 0.00 (0/69) 0.00 (0/0) very_bad 0.00 (0/216) 3.32 (12/361) 0.25 (1/399) 0.00 (0/31) Total 0.00 (0/753) 1.53 (25/1632) 2.23 (15/672) 0.00 (0/31)
Analysis of Sample Report
Confusion Matrix. The possible states of Spark Quality are good, bad and very_bad. For each case processed, Netica generated beliefs for each of these states. The most likely state (i.e. the one with the highest belief) was chosen as its prediction for the value of Spark Quality. This was then compared with the true value of Spark Quality for that case, providing the case file could supply it. The confusion matrix supplies the total number of cases in each of the 9 situations: (Predicted=good, Actual=good), (Predicted=bad, Actual=good), etc. If the network is performing well then the entries along the main diagonal will be large compared to those off of it.
Error Rate. In our sample report, the error rate is 6.325%. This means that in 6.325% of the cases for which the case file supplied a Spark Quality value, the network predicted the wrong value, where the prediction was taken as the state with highest belief (same as for the confusion matrix).
Scoring Rule Results. This score doesn't just take the most likely state as a prediction, but rather considers the actual belief levels of the states in determining how well they agree with the value in the case file. These results are calculated in the standard way for scoring rules. For more information see any reference on scoring rules, such as Morgan&Henrion90 or Pearl78.
Logarithmic and Quadratic loss, and Spherical Payoff. The logarithmic loss values were calculated using the natural log, and are between 0 and infinity inclusive, with zero indicating the best performance. Quadratic loss (also known as the Brier score) is between 0 and 2, with 0 being best, and spherical payoff is between 0 and 1, with 1 being best.
Their respective equations are:
where Pc is the probability predicted for the correct state, Pj is the probability predicted for state j, n is the number of states, and MOAC stands for the mean (average) over all cases (i.e. all cases for which the case file provides a value for the node in question).
Calibration. This indicates whether the confidence expressed by the network is appropriate (i.e. "well calibrated"). For instance, if the network were forecasting the weather, you might want to know: Of all the times it said 30% chance of rain, what percentage of times did it rain? If there were lots of cases, the answer should be close to 30%. For each state of the node there are a number of items separated by vertical bars (|). Each item consists of a probability percentage range R, followed by a colon (:) and then a single percentage X. It means that of all the times the belief for that state was within the range R, X percent of them the true value was that state.
rain 0-10: 8.5 |
means that of all the times the belief for rain was between 0 and 10%, 8.5% of those times it rained. The reason that the probability ranges are uneven, and different from state to state, and run to run, is that they are chosen so that the X percentages are reasonably accurate. The bin sizes have to adapt, or there might not be enough cases falling in that bin. The more cases you process, the more fine will be the probability ranges. Calibration results are often drawn as a graph (known as a "calibration curve") where ideal calibration is a straight diagonal line. For more information, see a text which discusses probability "calibration" for example, Morgan&Henrion90,p.110, referenced above.
Times Surprised Table. This indicates how often the network was quite confident in its beliefs, but was wrong. There are columns for being 90% confident and 99% confident (i.e. beliefs are greater than 90% or 99% respectively), and also for being 90% and 99% confident that the value of the node will _not_ be a certain state (i.e. beliefs are less than 10% or 1% respectively). The ratios indicate the number of times it was wrong out of the number of times it made such a confident prediction, and a percentage is also printed. If the network is performing well these percentages will be low, but keep in mind that it is very reasonable to be wrong with a particular 10% or 90% prediction 10% of the time, and to be wrong with a particular 1% or 99% prediction 1% of the time. If the network rarely makes strong predictions (i.e. beliefs are rarely close to 0 or 1), then these most of these ratios will be 0/0.
Quality of Test(binary nodes) or Test Sensitivity (nState>2) These reports are useful when the output of the network is going to be used to decide an action, with one action corresponding to each state of the node. As a medical example, the node may be "Disease-A" and have the two states "Present" and "Absent". If, after updating for a case, the network reports "Present", then a particular treatment will be started, but if it reports "Absent" then the treatment won't be started. The question is, at what probability for "Present" should we say that the network is reporting Present?
The confusion matrix and error rate discussed above were determined using the maximum likelihood state (i.e. the one with highest belief after updating). For a binary variable, this means choosing the first state only if its belief is higher than 50%. But if each state has a different cost of misclassification, you may not want the cutoff probability to be 50%. In the medical example, it may be disastrous to not treat a patient who has the disease, but not that serious if he is treated unnecessarily. So you would like the network to report "Present" if the probability of the disease is above some small number, like 2%. It is a matter of trading off the rate of false positives against the rate of false negatives.
Ideally you would just convert the network to a decision network, by adding a decision node for the action to be taken and a utility node for the cost of misclassification. However, at the time the network is constructed and being graded as to its usefulness, the utilities may not be known.
The "Quality of Test" section has performance results for a series of cutoff threshold probabilities (which run vertically in the first column). For each case, the beliefs given by the network are converted to a "prediction". The prediction is "first state" if the belief for the first state is higher than the cutoff probability, and "second state" if it's lower. You may want to change the order of the states, so that the first state is the "positive" one, to better match conventional meanings. The meanings of the columns are:
Often this data is summarized with a graph called the ROC (receiver operating
characteristic) curve. To use Excel (available from Microsoft) to create
the ROC curve from this data, select the whole table (except headings)
and while holding down the If the node has more than 2 states, instead you will get a
"Test Sensitivity" section. The first number of each "column"
is the cutoff threshold probability. The second number of each column
is the number of cases whose actual value was the state given at the
left hand side of the row, and which the network correctly predicted to
be in that state (i.e. its belief was greater than cutoff probability),
divided by the total number of cases whose actual value was that state.
It may seem awkward that the cutoff probability changes in strange
sized jumps. The reason is that Netica only reports on values for
which it was able to gather enough data. So running the test
using a greater number of cases generally results in finer divisions
of the cutoff column.
NOTES on Net Testing
More complete documentation will be available on Net Testing at a later date.
If you have any questions, suggestions, or requests, or if you find
any problems with the software as it exists, please email Norsys.
If the node has more than 2 states, instead you will get a "Test Sensitivity" section. The first number of each "column" is the cutoff threshold probability. The second number of each column is the number of cases whose actual value was the state given at the left hand side of the row, and which the network correctly predicted to be in that state (i.e. its belief was greater than cutoff probability), divided by the total number of cases whose actual value was that state.
It may seem awkward that the cutoff probability changes in strange sized jumps. The reason is that Netica only reports on values for which it was able to gather enough data. So running the test using a greater number of cases generally results in finer divisions of the cutoff column.
NOTES on Net Testing
More complete documentation will be available on Net Testing at a later date. If you have any questions, suggestions, or requests, or if you find any problems with the software as it exists, please email Norsys.
|Return to Tutorial Home|