Learning Algorithms

There are three main types of algorithms that Netica can use to learn CPTs: counting, expectation-maximization (EM) and gradient descent.  Of the three, “counting” is by far the fastest and simplest, and should be used whenever it can.  It can be used whenever there are no latent variables, and not much missing data or uncertain findings for the learning nodes or their parents.  When learning the CPT of a node by counting, Netica will only use those cases which supply a definite value for the node and all of its parents.

If you can’t use counting, then you must use EM learning or gradient descent.  For each application area, it is usually best to try each one to see which gives the better results.  Generally speaking, EM learning is more robust (i.e. gives good results in wide variety of situations), but sometimes gradient descent is faster.  For all three algorithms, the order of the cases doesn’t matter.

During Bayes net learning, we are trying to find the maximum likelihood Bayes net, which is the net that is the most likely given the data.  If N is the net and D is the data, we are looking for the N which gives the highest P(N|D).  Using Bayes rule, P(N|D) = P(D|N) P(N) / P(D).  Since P(D) will be the same for all the candidate nets, we are trying to maximize P(D|N) P(N), which is the same as maximizing its logarithm: log(P(D|N)) + log(P(N)).  Below we consider each of the two terms of this equation.  The more data you have, the more important the first term will be compared to the second.

There are different approaches to dealing with the second term log(P(N)), which is the prior probability of each net (i.e. how likely you think each net is before seeing any data).  One approach is to say that each net is equally likely, in which case the term can simply be ignored, since it will contribute the same amount for each candidate net.  Another is to penalize complex nets by saying they are less likely (which is of more value when doing structure learning).  Netica bases the prior probability of each net on the experience and probability tables that exist in the net before learning starts, which appears to be a unique and elegant approach.  If the net has not been given any such tables, then Netica considers all candidate nets equally likely before seeing any data.

The first term log(P(D|N)) is known as the net’s log likelihood , If the data D consists of the n independent cases d1, d2, … dn, then the log likelihood is: log(P(D|N)) = log(P(d1|N) P(d2|N) … P(dn|N)) = log(P(d1|N)) + log(P(d2|N)) + … + log(P(dn|N)).  Each of the log(P(di|N)) terms is easy to calculate, since the case is simply entered into the net as findings, and Netica’s regular inference is used to determine the probability of the findings.

Both EM and gradient descent learning work by an iterative process, in which Netica starts with a candidate net, reports its log likelihood, then processes the entire case set with it to find a better net.  By the nature of each algorithm the log likelihood of the new net is always as good as or better than the previous.  That process is repeated until the log likelihood numbers are no longer improving enough (according to a tolerance that you can specify), or the desired number of iterations has been reached (also a quantity you can specify).  Netica uses a conjugate gradient descent, which performs much better than simple gradient descent.

References:  To understand how each algorithm works, it is best to consult a reference, such as Korb&Nicholson04, Russell&Norvig95 or Neapolitan04.  Briefly, EM learning repeatedly takes a Bayes net and uses it to find a better one by doing an expectation (E) step followed by a maximization (M) step. In the E step, it uses regular Bayes net inference with the existing Bayes net to compute the expected value of all the missing data, and then the M step finds the maximum likelihood Bayes net given the now extended data (i.e. original data plus expected value of missing data).  Gradient descent learning searches the space of Bayes net parameters by using the negative log likelihood as an objective function it is trying to minimize.  Given a Bayes net, it can find a better one by using Bayes net inference to calculate the direction of steepest gradient to know how to change the parameters (i.e. CPTs) to go in the steepest direction of the gradient (i.e. maximum improvement).  Actually, it uses a much more efficient approach than always taking the steepest path, by taking into account its previous path, which is why it’s called conjugate gradient descent.  Both algorithms can get stuck in local minima, but in actual practice do quite well, especially the EM algorithm.

Most neural network learning algorithms (such as back propagation and its improvements) are gradient descent algorithms.  That invites a comparison between Bayes net learning and neural net learning, with latent variables corresponding to hidden neurons.  In the case of Bayes net learning, there are generally fewer hidden nodes, the learned relationships between the nodes are generally more complex, the result of the learning has a direct physical interpretation (by probability theory) rather than just being black-box type weights, and the result of the learning is more modular (parts can be separated off and combined with other learned structures).