|D. Advanced Topics|
|Return to Tutorial Home|
1. Missing Data and Hidden Variables
EM and Gradient Descent learning are two techniques for learning totally "hidden" (also known as "latent") variables, that is, variables for which you have no observations, but you suspect exist and can be useful for modeling your world.
An example of such a situation is when you conjecture that a number of nodes may have a common unobservable cause, such as a disease which is not directly observable, giving rise to a number of symptoms, as in:
In this case, we have a number of symptoms that we observe in our patients, and which appear to be correlated. So we guess that there might be a disease that explains the symptoms. Not all patients with the disease will necessarily express all the symptoms. We are basically guessing that there is an underlying disease. There could in fact be more than one disease, or some other reasons could be explaining the correlations (an evil doctor could be injecting his patients with drugs whose side-effects mimic diseases).
How many states to choose for the hidden variable? When you build the net with Netica, you add a node for the latent variable, and provide it with the number of states you think it should have. You may have to repeat this whole process, with a different number of states each time you do a learning run, to see what best matches the data. It is better to err on the side of having too few states, since too many states will be difficult for Netica to learn, and for you to interpret. Keep in mind that the node Netica learns will probably have the states in a different order than you expect, since Netica has no knowledge that would help it to decide an order for the states.
You may have guessed from the above paragraph that learning the optimal properties of a hypothetical hidden variable is not a simple matter of pressing a single button and letting Netica do all the work for you. Netica will help, but you will need to assist Netica by steering towards the best overall solution.
Note #1. EM and Gradient Descent learning are only available with versions 2.17 and later of Netica.
Note #2. If there are no hidden variables in your net, or significant amounts of missing data in you case files, you should not use EM or Gradient Descent learning. The results would likely be similar to those for regular learning, and they would take longer to compute.
1.1 EM Learning
In this exercise we demonstrate how to go about learning a latent (or "hidden") variable using EM learning.
Let us suppose that the "real" world is represented by the following Bayes net:
By this we mean, that the real world we want to model is governed by the conditional probability distributions inherent in the above net. Download this net from here and load it into Netica.
In what follows we will try to learn a second net, Net #2, using EM learning, and see if the resultant net bears any similarity to the original net above. If it does, we can then say we have done a good job of learning the structure of the "real" world.
The following case file abides by the statistics of our "real" world. Note that it is missing any reference to A, which is an unobservable node in our real world.
// 100K cases following the distribution of the "Learn_Latent.dne" Bayes net NumCases R S T 22912 false false false 4096 false false true 4688 false true false 1104 false true true 35808 true false false 7582 true false true 2 true false * 14592 true true false 9216 true true true
The EM learning task is to use the above case file of observables to see if perhaps some fourth variable, say A, can explain R, S, and T.
You can cut and paste the above text into a file, Learn_Latent_Data.cas. Or, if you prefer, you can generate a similar one yourself from the above net, by selecting nodes R, S, T, and then choosing "Cases->Simulate Cases". Then save the results as the file Learn_Latent_Data.cas. Such a file will have 100,000 lines in it, but the frequency of the eight possible cases should be statistically similar to the above.
The next step is to build the basic structure of our 2nd net.
With Netica, we construct a new empty net, and give it four nodes, three for the variables in our database of observables, and one as our hypothetical cause for the other three:
So, to review what we are up to: our conjecture is that there is a common cause, A, for the observations R, S and T, and we want to determine how R, S and T probabilistically depend on A.
You could build the above new net from scratch, or just delete the CPTs of the "Learn_Latent" net, if you want to save time.
We are now ready to do some EM learning.
Choose "Cases->Learn Using EM" and select the "Learn Latent Data.cas" file from the dialog box, and accept the default degree of 1. Netica will open the Messages window and proceed to do many steps of the EM algorithm to learn the new CPTs. For each step of EM, it prints a line showing the iteration number, the "log likelihood", and the percentage change in log likelihood from the previous iteration. The "log likelihood" is the per case average of the negative of the logarithm of the probability of the case given the current Bayes net (structure + CPTs).
Normally you will only do a few steps of EM algorithm, and then stop the process by holding down the left mouse button and the <ctrl> and <alt> keys for a little while. However, in our example the program runs to completion quite quickly.
You can now compare the learned net with the original (which represents
the distribution in the real world). For instance, select node R of
the original net and choose
A Note for Students of Learning Theory:
The EM algorithm searches over Bayes net CPTs in an attempt to maximize the probability of the data given the Bayes net (i.e. minimize negative log likelihood). This can also be done using a gradient descent algorithm, described next.
Gradient Descent works similar to EM learning only uses a very different algorithm internally (one that is similar to Neural Net back propagation). After you have explored EM learning, you may want to try Gradient Descent learning to see if it works better for you. To run it, simply choose "Cases->Learn Using Gradient".
A Note for Students of Learning Theory:
The Gradient Descent algorithm used in Netica is a "conjugate gradient descent" to maximize the probability of the data, given the net, by adjusting the CP Table entries. Generally speaking this algorithm converges faster than EM learning, but may be more susceptible to local maxima.
At Norsys, we would appreciate any comments you have on our advanced learning algorithms, and reports of particular successes or failures you have had on your data sets. (email email@example.com).
Copyright © 1995-2022 Norsys Software Corp.
Norsys and Netica are trademarks of Norsys Software Corp.