How to use the Bayes Net Toolbox 2.0

Please note that version 2.0 is not backwards compatible with versions released before 3/29/00.

Getting started: Installation
Your first Bayes net
A hybrid Bayes net
Hierarchical mixtures of experts
PCA, ICA, and all that
QMR
Variable elimination
A summary of the different inference engines (static)
DBNs, HMMs, Kalman filters and all that

Getting started: Installation

Download the BNT.tar file.
Unpack the file. In Unix, type "tar xvf BNT.tar" or "unzip BNT.zip". In Windows, use a program like Winzip. This will create a directory called BNT.
Edit the file "BNT/add_BNT_to_path.m" so it contains the correct pathname.
Move to the BNT directory.
Start up Matlab.
Type "ver". You need Matlab version 5.2 or newer to run BNT.
Type "add_BNT_to_path" to extend the matlab path variable.
Type "test_BNT". If all goes well, this will produce a bunch of numbers (which you can ignore), and no error messages.

If you are new to Matlab, you might like to check out some useful Matlab tips.

Creating your first Bayes net

Representation

Consider the following example (adapted from Russell and Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall, 1995, p454.)

To specify this directed acyclic graph (dag), we create an adjacency matrix:

N = 4; 
dag = zeros(N,N);
C = 1; S = 2; R = 3; W = 4;
dag(C,[R S]) = 1;
dag(R,W) = 1;
dag(S,W)=1;

We have numbered the nodes as follows: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. The nodes must always be numbered in topological order, i.e., ancestors before descendants. For a more complicated graph, this is a little inconvenient: we will see how to get around this below.

In addition to specifying the graph structure, we must specify the size and type of each node. If a node is discrete, its size is the number of possible values each node can take on; if a node is continuous, it can be a vector, and its size is the length of this vector. In this case, we will assume all nodes are discrete and binary.

discrete_nodes = 1:N;
node_sizes = 2*ones(1,N);

We are now ready to make the Bayes net

bnet = mk_bnet(dag, node_sizes, discrete_nodes);

There are several optional arguments to the 'mk_bnet' function, some of which we will see later. In general, to find out more about a function, please see its documentation string by typing

help mk_bnet

Inference

Having created the BN, we can now use it for inference. There are many different algorithms for doing inference in Bayes nets, that make different tradeoffs between speed, complexity, generality, and accuracy. BNT therefore offers a variety of different inference "engines". We will discuss these in more detail below. For now, we will use the junction tree engine, which is the mother of all exact inference algorithms. This can be created as follows.

engine = jtree_inf_engine(bnet);

The other engines have similar constructors, but might take additional, algorithm-specific parameters. All engines are used in the same way, once they have been created. We illustrate this in the following sections.

Suppose we want to compute the probability that the sprinker was on given that the grass is wet. The evidence consists of the fact that W=2. All the other nodes are hidden (unobserved). We can specify this as follows.

evidence = cell(1,N);
evidence{W} = 2;

We use a 1D cell array instead of a vector to cope with the fact that nodes can be vectors of different lengths. In addition, the value [] can be used to denote 'no evidence', instead of having to specify the observation pattern as a separate argument.

We are now ready to add the evidence to the engine.

[engine, loglik] = enter_evidence(engine, evidence);

The behavior of this function is algorithm-specific, and is discussed in more detail below. In the case of the jtree engine, enter_evidence implements a two-pass message-passing scheme. The first return argument contains the modified engine, which incorporates the evidence. The second return argument contains the log-likelihood of the evidence. (Not all engines are capable of computing the log-likelihood.)

Finally, we can compute p=P(S=2|W=2) as follows.

marg = marginal_nodes(engine, S);
p = marg.T(2);

We find that p = 0.4298.

Now let us add the evidence that it was raining, and see what difference it makes.

evidence{R} = 2;
[engine, loglik] = enter_evidence(engine, evidence);
marg = marginal_nodes(engine, S);
p = marg.T(2);

We find that p = P(S=2|W=2,R=2) = 0.1945, which is lower than before, because the sprinkler's being on can ``explain away'' the fact that the grass is wet.

Computing joint distributions

We can compute the joint probability on a set of nodes as in the following example.

evidence = cell(1,N);
[engine, ll] = enter_evidence(engine, evidence);
m = marginal_nodes(engine, [S R W]);
>> m =
    domain: [2 3 4]
         T: [2x2x2 double]
>> m.T
ans(:,:,1) =
    0.2900    0.0410
    0.0210    0.0009
ans(:,:,2) =
         0    0.3690
    0.1890    0.0891

m is a structure. The 'T' field is a multi-dimensional array (in this case, 3-dimensional) that contains the joint probability distribution on the specified nodes. We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass to be wet if both the rain and sprinkler are off.

Let us now add some evidence to R.

evidence{R} = 2;
[engine, ll] = enter_evidence(engine, evidence);
m = marginal_nodes(engine, [S R W]);
m = 
    domain: [2 3 4]
         T: [2x1x2 double]
>> m.T
m.T
ans(:,:,1) =
    0.0820
    0.0018
ans(:,:,2) =
    0.7380
    0.1782

We see The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible with the evidence that R=2. Instead of creating large tables with many 0s, BNT sets the effective size of observed (discrete) nodes to 1. This is why m.T has size 2x1x2. We can convert this back to a 2x2x2 table as follows.

fullm = add_ev_to_dmarginal(m, evidence, ns)

Note: It is not always possible to compute the joint on arbitrary sets of nodes: it depends on the engine, as discussed in more detail below.

Learning

Let's generate some data from this network, randomize the parameters, and then try to recover the original model. First we generate training cases.

nsamples = 50;
samples = sample_bnet(bnet, nsamples);

% hide 50% of the nodes
samples2 = samples;
hide = rand(N, nsamples) > 0.5;
[I,J]=find(hide);
for k=1:length(I)
  samples2{I(k), J(k)} = [];
end

Each column of 'samples' is a training case. Now we create a network with random parameters.

% Make a tabular rasa
bnet2 = mk_bnet(dag, ns);
seed = 0;
rand('state', seed);
bnet2.CPD{C} = tabular_CPD(bnet2, C);
bnet2.CPD{R} = tabular_CPD(bnet2, R);
bnet2.CPD{S} = tabular_CPD(bnet2, S);
bnet2.CPD{W} = tabular_CPD(bnet2, W);
engine2 = jtree_inf_engine(bnet2);

Finally, we find the maximum likelihood estimates of the parameters using EM.

max_iter = 10; 
[engine3, LL] = learn_params(engine2, samples2, max_iter);

LL(i) is the log-likelihood at iteration i. We can plot this as follows:

plot(LL, 'x-')

To view the learned parameters, we need to extract the bnet structure from the engine object, and then extract the parameters from each CPD object. I have not created a method for the latter, so we just use a little Matlab hackery.

bnet3 = bnet_from_engine(engine3);
CPT3 = cell(1,N);
for i=1:N
  s=struct(bnet3.CPD{i});  % violate object privacy
  CPT3{i}=s.CPT;
end

Let's display the results after 10 iterations of EM.

celldisp(CPT3)
CPT3{1} =
    0.6783
    0.3217
CPT3{2} =
    0.7796    0.2204
    0.0531    0.9469
CPT3{3} =
    0.6719    0.3281
    0.8566    0.1434
CPT3{4} =
(:,:,1) =
    0.9762    0.5017
    0.0931    0.0058
(:,:,2) =
    0.0238    0.4983
    0.9069    0.9942

So we see that the learned parameters are fairly close to the original ones, which we display below for completeness.

CPT{1} =
    0.5000
    0.5000
CPT{2} =
    0.5000    0.5000
    0.9000    0.1000
CPT{3} =
    0.8000    0.2000
    0.2000    0.8000
CPT{4} =
(:,:,1) =
    1.0000    0.1000
    0.1000    0.0100
(:,:,2) =
         0    0.9000
    0.9000    0.9900

We can get improved performance by (1) increasing the size of the training set, (2) decreasing the amount of hidden data, (3) running EM for longer, and (4) using informative priors. As an example of the latter, let us specify a Dirichlet prior (pseudo counts) for each node which is equal to the true CPT multiplied by an effective sample size (prior). (If prior = 0, we are doing maximum likelihood estimation.) The [] argument in the following means 'generate random parameters'.

bnet2 = mk_bnet(dag, ns);
seed = 0;
rand('state', seed);
prior  = nsamples;
bnet2.CPD{C} = tabular_CPD(bnet2, C, [], prior*CPT{C});
bnet2.CPD{R} = tabular_CPD(bnet2, R, [], prior*CPT{R});
bnet2.CPD{S} = tabular_CPD(bnet2, S, [], prior*CPT{S});
bnet2.CPD{W} = tabular_CPD(bnet2, W, [], prior*CPT{W});

In this case, after 4 iterations of EM, we get

CPT3{1} =
    0.5564
    0.4436
CPT3{2} =
    0.8076    0.1924
    0.1950    0.8050
CPT3{3} =
    0.5428    0.4572
    0.9021    0.0979
CPT3{4} =
(:,:,1) =
    0.9998    0.1316
    0.0985    0.0099
(:,:,2) =
    0.0002    0.8684
    0.9015    0.9901

Note that by using a prior, EM is no longer guaranteed to increase the likelihood at each iteration; instead, it will be increasing the height of the posterior mode.

To add a uniform (uninformative) Dirichlet prior for regularization purposes, use something like this:

bnet2.CPD{W} = tabular_CPD(bnet2, W, [], prior*normalise(myones(ns[R S W])));

This will avoid estimating a parameter as 0 just because there is insufficient training data. I recommend the following tutorials for more details on this kind of thing:

"A tutorial on learning with Bayesian networks", D. Heckerman, Microsoft Research Tech Report, 1995.
"Operations for Learning with Graphical Models", W. L. Buntine, JAIR'94, 159--225.

A hybrid Bayes Net

Let us now create the following Bayes net, which is a model of waste emissions from an incinerator plant. Square nodes are discrete (tabular), and round nodes are Gaussian. (Bayes nets which contain discrete and continuous random variables are sometimes called "hybrid".) Shaded nodes will be observed. (This example is from the paper "Propagation of Probabilities, Means amd Variances in Mixed Graphical Association Models", Steffen Lauritzen, JASA 87(420):1098--1108, 1992. It is reprinted in the book "Probabilistic Networks and Expert Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter, Springer, 1999.)

We can create this model as follows.

F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
n = 9;

dag = zeros(n);
dag(F,E)=1;
dag(W,[E Min D]) = 1;
dag(E,D)=1;
dag(B,[C D])=1;
dag(D,[L Mout])=1;
dag(Min,Mout)=1;

% node sizes - all cts nodes are scalar, all discrete nodes are binary
ns = ones(1, n);
dnodes = [F W B];
cnodes = mysetdiff(1:n, dnodes);
ns(dnodes) = 2;

bnet = mk_bnet(dag, ns, dnodes);

'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.

The parameters of the discrete nodes can be specified as follows.

bnet.CPD{B} = tabular_CPD(bnet, B, [0.85 0.15]'); % 1=stable, 2=unstable
bnet.CPD{F} = tabular_CPD(bnet, F, [0.95 0.05]'); % 1=intact, 2=defect
bnet.CPD{W} = tabular_CPD(bnet, W, [2/7 5/7]'); % 1=industrial, 2=household

In general, Gaussian nodes are parameterized as follows. Suppose the node is called Y, its continuous parents (if any) are called X, and its discrete parents (if any) are called Q. The the distribution on Y is defined as follows:

- no parents: Y ~ N(mu, Sigma)
- cts parents : Y|X=x ~ N(mu + W x, Sigma)
- discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
- cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) x, Sigma(:,:,i))

where N(mu, Sigma) denotes a Normal distribution with mean mu and covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q respectively. If there are no discrete parents, |Q|=1; if there is more than one, then |Q| = a vector of the sizes of each discrete parent. If there are no continuous parents, |X|=0; if there is more than one, then |X| = the sum of their sizes. Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) matrix.

We can specify the parameters of the model as follows. (If we omit any parameters, random values of the appropriate size will be created.)


mu = reshape([-3.9 -0.4 -3.2 -0.5], [1 2 2]);
Sigma = reshape([0.00002 0.0001 0.00002 0.0001], [1 1 2 2]);
bnet.CPD{E} = gaussian_CPD(bnet, E, mu, Sigma);

mu = reshape([6.5 6.0 7.5 7.0], [1 2 2]);
Sigma = reshape([0.03 0.04 0.1 0.1], [1 1 2 2]);
weights = reshape([1 1 1 1], [1 1 2 2]);
bnet.CPD{D} = gaussian_CPD(bnet, D, mu, Sigma, weights);

mu = reshape([-2 -1], [1 2]);
Sigma = reshape([0.1 0.3], [1 1 2]);
bnet.CPD{C} = gaussian_CPD(bnet, C, mu, Sigma);

bnet.CPD{L} = gaussian_CPD(bnet, L, 3, 0.25, -0.5);

mu = reshape([0.5 -0.5], [1 2]);
Sigma = reshape([0.01 0.005], [1 1 2]);
bnet.CPD{Min} = gaussian_CPD(bnet, Min, mu, Sigma);

bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 0, 0.002, [1 1]);

Inference

Let us now perform inference. First we compute the unconditional marginals.

engine = jtree_inf_engine(bnet);
evidence = cell(1,n);
[engine, ll] = enter_evidence(engine, evidence);
marg = marginal_nodes(engine, E);

'marg' is a structure that contains the fields 'mu' and 'Sigma', which contain the mean and (co)variance of the marginal on E. In this case, they are both scalars. Let us check they match the published figures to 2dp. (We can't expect more precision than this in general because I have implemented the algorithm of Lauritzen (1992), which can be numerically unstable.)

tol = 1e-2;
assert(approxeq(marg.mu, -3.25, tol));
assert(approxeq(sqrt(marg.Sigma), 0.709, tol));

We can compute the other posteriors similarly. Now let us add some evidence.

evidence = cell(1,n);
evidence{W} = 1; % industrial
evidence{L} = 1.1;
evidence{C} = -0.9;
[engine, ll] = enter_evidence(engine, evidence);

Now we find

marg = marginal_nodes(engine, E);
assert(approxeq(marg.mu, -3.8983, tol));
assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));

We can also compute the joint probability on a set of nodes. For example, P(D, Mout | evidence) is a 2D Gaussian:

marg = marginal_nodes(engine, [D Mout])
marg = 
    domain: [6 8]
        mu: [2x1 double]
     Sigma: [2x2 double]
         T: 1.0000

The mean is

marg.mu
ans =
    3.6077
    4.1077

and the covariance matrix is

marg.Sigma
ans =
    0.1062    0.1062
    0.1062    0.1182

It is easy to visualize this posterior using standard Matlab plotting functions. The T field indicates that the mixing weight of this Gaussian component is 1.0. If the joint contains discrete and continuous variables, the result will be a mixture of Gaussians, e.g.,

marg = marginal_nodes(engine, [F E])
    domain: [1 3]
        mu: [-3.9000 -0.4003]
     Sigma: [1x1x2 double]
         T: [0.9995 4.7373e-04]

BNT sets the effective size of observed continuous nodes to 0, and the effective size of observed discrete nodes to 1, since they have known values, and hence can effectively be removed from the graph. For example,

marg = marginal_nodes(engine, [B C])
    domain: [4 5]
        mu: []
     Sigma: []
         T: [0.0123 0.9877]

It is easy to convert this to change this way of handling evidence: see the file BNT/examples/static/cg1 for an example.

Hierarchical mixtures of experts

Below we show the Mixture of Experts as a Bayes net. We also show the Hierarchical Mixture of Experts model, where the hierarchy has two levels. (This is essentially a probabilistic decision tree of height two.) As before, circles denote continuous-valued nodes, squares denote discrete nodes, clear means hidden, and shaded means observed.

X is the observed input, Y is the output, and the Q nodes are hidden "gating" nodes, which select the appropriate set of parameters for Y. During training, Y is assumed observed, but for testing, the goal is to predict Y given X. Note that this is a conditional density model, so we don't associate any parameters with X. Hence X's CPD will be a root CPD, which is a way of modelling exogenous nodes. If the output is a continuous-valued quantity, we assume the "experts" are linear-regression units, and set Y's CPD to linear-Gaussian. If the output is discrete, we set Y's CPD to a softmax function. The Q CPDs will always be softmax functions.

The softmax distribution (also known as the multinomial logit function) is as follows:

                    exp(w(:,i)'*x + b(i)) 
Pr(Q=i | X=x)  =  -----------------------------
                  sum_j   exp(w(:,j)'*x + b(j))

The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the following interpretation: w(:,i)-w(:,j) is the normal vector to the decision boundary between classes i and j, and b(i)-b(j) is its offset (bias). For example, suppose X is a 2-vector, and Q is binary. Then

w = [1 -1;
     0 0];

b = [0 0];

means class 1 are points in the 2D plane with positive x coordinate, and class 2 are points in the 2D plane with negative x coordinate. If w has large magnitude, the decision boundary is sharp, otherwise it is soft.

In the special case that Q is binary (0/1), the softmax function reduces to the logistic (sigmoid) function as follows (we drop the offset term b for simplicity):

mu = E[Q | X=x] = Pr(Q = 1 | X=x) 

                 exp(w(:,1)'*x) 
  =  --------------------------------------------
       exp(w(:,1)'*x) + exp(w(:,0)'*x) 

                      exp(0)
  =  ---------------------------------------------
             exp(0) + exp([w(:,0)-w(:,1)]'*x)
                
           1
  = -----------------------------  
    1 + exp(-theta'*x)

where theta = w(:,1) - w(:,0). Hence in general

                     exp( [theta'*x] Q )
Pr(Q=i | X=x) =  -----------------------------
                      1 + exp(theta'*x)

Another way of writing this is as follows

     mu
ln -----  = theta' * x
    1-mu

where the LHS is the logit function (ln of the log-odds ratio). This is how logistic regression is usually defined.

Fitting a softmax function can be done using the iteratively reweighted least squares (IRLS) algorithm. We use the implementation from netlab. Note that since the softmax distribution is not in the exponential family, it does not have finite sufficient statistics, and hence we must store all the training data in uncompressed form. If this takes too much space, one should use online (stochastic) gradient descent (not implemented in BNT).

For more details on the softmax function, the hierarchical mixtures of experts model and logistic regression respectively, I recommend the following:

Why the logistic function? A tutorial discussion on probabilities and neural networks. M. I. Jordan. MIT Computational Cognitive Science Report 9503, August 1995.
Hierarchical mixtures of experts and the EM algorithm M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
"Generalized Linear Models", McCullagh and Nelder, Chapman and Halll, 1983.

As a concrete example, consider the mixture of experts model where X and Y are scalars, and Q is binary. This is just piecewise linear regression, where we have two line segments, i.e.,

We can create this model with random parameters as follows.

X = 1;
Q = 2;
Y = 3;
dag = zeros(3,3);
dag(X,[Q Y]) = 1
dag(Q,Y) = 1;
ns = [1 2 1]; % make X and Y scalars, and have 2 experts
dnodes = [2];
onodes = [1 3]; % observed nodes
bnet = mk_bnet(dag, ns, dnodes);

rand('state', 0);
randn('state', 0);
bnet.CPD{1} = root_CPD(bnet, 1);
bnet.CPD{2} = softmax_CPD(bnet, 2);
bnet.CPD{3} = gaussian_CPD(bnet, 3);

Now let us fit this model using EM. First we load the 200 training cases and plot them.

load data -ascii;
plot(data(:,1), data(:,2), '.');

This is what the model looks like before training. (Thanks to Thomas Hoffman for writing this plotting routine.)

Now let's train the model, and plot the final performance.

ncases = size(data, 1); % each row of data is a training case
cases = cell(3, ncases);
cases([1 3], :) = num2cell(data'); % each column of cases is a training case
engine = jtree_inf_engine(bnet, onodes);
max_iter = 20;
[engine2, LL2] = learn_params(engine, cases, max_iter);

(We specify which nodes will be observed when we create the engine. Hence BNT knows that the hidden nodes are all discrete. For complex models, this can lead to a significant speedup.) Below we show what the model looks like after 16 iterations of EM (with 100 IRLS iterations per M step), when it converged using the default convergence tolerance (that the fractional change in the log-likelihood be less than 1e-3)/ Before learning, the log-likelihood was -322.927442; afterwards, it was -13.728778.

PCA, ICA, and all that

In Figure (a) below, we show how Factor Analysis can be thought of as a graphical model. Here, X has an N(0,I) prior, and Y|X=x ~ N(mu + Wx, Psi), where Psi is diagonal and W is called the "factor loading matrix". Since the noise on both X and Y is diagonal, the components of these vectors are uncorrelated, and hence can be represented as individual scalar nodes, as we show in (b). (This is useful if parts of the observations on the Y vector are occasionally missing.) We usually take k=|X| << |Y|=D, so the model tries to explain many observations using a low-dimensional subspace.


(a)	(b)	(c)	(d)

We can create this model in BNT as follows.

ns = [k D];
dag = zeros(2,2);
dag(1,2) = 1;
dnodes = [];
onodes = 2;
bnet = mk_bnet(dag, ns, dnodes);
bnet.CPD{1} = gaussian_CPD(bnet, 1, zeros(k,1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov');
bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(D,1), diag(Psi0), W0, 'diag', 'untied', 'clamp_mean');

The root node is clamped to the N(0,I) distribution, so that we will not update these parameters during learning. The mean of the leaf node is clamped to 0, since we assume the data has been centered (had its mean subtracted off); this is just for simplicity. Finally, the covariance of the leaf node is constrained to be diagonal. W0 and Psi0 are the initial parameter guesses.

We can fit this model using EM in the usual way (see the file BNT/examples/static/fa1 for details). Not surprisingly, the ML estimates for mu and Psi turn out to be identical to the sample mean and variance, which can be computed directly as

mu_ML = mean(data);
Psi_ML = diag(cov(data));

Note that W can only be identified up to a rotation matrix, because of the spherical symmetry of the source.

If we restrict Psi to be spherical, i.e., Psi = sigma I, there is a closed-form solution for W as well, i.e., we do not need to use EM. In particular, W contains the first |X| eigenvectors of the sample covariance matrix, with scalings determined by the eigenvalues and sigma. Classical PCA can be obtained by takting the sigma->0 limit. For details, see

"EM algorithms for PCA and SPCA", Sam Roweis, NIPS 97. ( Matlab software)
"Mixtures of probabilistic principal component analyzers", Tipping and Bishop, Neural Computation 11(2):443--482, 1999.

By adding a hidden discrete variable, we can create mixtures of FA models, as shown in (c). Now we can explain the data using a set of subspaces. We can create this model in BNT as follows.

ns = [M k D];
dag = zeros(3);
dag(1,3) = 1;
dag(2,3) = 1;
dnodes = 1;
onodes = 3;

bnet = mk_bnet(dag, ns, dnodes);
bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(k, 1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov');
bnet.CPD{3} = gaussian_CPD(bnet, 3, Mu0', repmat(diag(Psi0), [1 1 M]), W0, 'diag', 'tied');

Notice how the covariance matrix for Y is the same for all values of Q; that is, the noise level in each sub-space is assumed the same. However, we allow the offset, mu, to vary. For details, see

The EM Algorithm for Mixtures of Factor Analyzers , Ghahramani, Z. and Hinton, G.E. (1996), University of Toronto Technical Report CRG-TR-96-1. (Matlab software)
"Mixtures of probabilistic principal component analyzers", Tipping and Bishop, Neural Computation 11(2):443--482, 1999.

I have included Zoubin's specialized MFA code (with his permission) with the toolbox, so you can check that BNT gives the same results: see 'BNT/examples/static/mfa1.m'.

Independent Factor Analysis (IFA) generalizes FA by allowing a non-Gaussian prior on each component of X. (Note that we can approximate a non-Gaussian prior using a mixture of Gaussians.) This means that the likelihood function is no longer rotationally invariant, so we can uniquely identify W and the hidden sources X. IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). We recover classical Independent Components Analysis (ICA) in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the weight matrix W is square and invertible. For details, see

Independent Factor Analysis, H. Attias, Neural Computation 11: 803--851, 1998.

QMR

Bayes nets originally arose out of an attempt to add probabilities to expert systems, and this is still the most common use for BNs. A famous example is QMR-DT, a decision-theoretic reformulation of the Quick Medical Reference (QMR) model.

Here, the top layer represents hidden disease nodes, and the bottom layer represents observed symptom nodes. The goal is to infer the posterior probability of each disease given all the symptoms (which can be present, absent or unknown).

The diseases have Bernoulli CPDs, and the findings have noisy-or CPDs. A noisy-OR node is like a regular logical OR gate except that sometimes the effects of parents that are on get inhibited. Let the prob. that parent i gets inhibited be q(i). Then a node, C, with 2 parents, A and B, has the following CPD, where we use F and T to represent off and on (1 and 2 in BNT).

A  B  P(C=off)      P(C=on)
---------------------------
F  F  1.0           0.0
T  F  q(A)          1-q(A)
F  T  q(B)          1-q(B)
T  T  q(A)q(B)      q-q(A)q(B)

Thus we see that the causes get inhibited independently. It is common to associate a "leak" node with a noisy-or CPD, which is like a parent that is always on. This can account for all other unmodelled causes which might turn the node on.

The noisy-or distribution is similar to the logistic distribution defined earlier. To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) be the prob. that j inhibits i. Then

Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)

Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then

Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))

For a sigmoid node, we have

Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))

where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of the activation function (although both are monotonically increasing). In addition, in the case of a noisy-or, the weights are constrained to be positive, since they derive from probabilities q(i,j).

We can create a random QMR-like model as follows. (We are not allowed to publish the actual parameters, since they are proprietary.)

pMax = 0.01;
Nfindings = 10;
Ndiseases = 5; 

N=Nfindings+Ndiseases;
findings = Ndiseases+1:N;
diseases = 1:Ndiseases;

% G(i,j) = 1 iff there is an arc from disease i to finding j
G = zeros(Ndiseases, Nfindings);
for i=1:Nfindings
  v= rand(1,Ndiseases);
  rents = find(v<0.8);
  if (length(rents)==0)
    rents=ceil(rand(1)*Ndiseases);
  end
  G(rents,i)=1;
end       

% prior(i) = prob. disease i is on
prior = pMax*rand(1,Ndiseases);

% leak(j) = inhibition prob. on leak->j arc
leak = 0.5*rand(1,Nfindings); % in real QMR, leak approx exp(-0.02) = 0.98     
%leak = ones(1,Nfindings); % turns off leaks

% inhibit(i,j) = inhibition probability on i->j arc
inhibit = rand(Ndiseases, Nfindings);
inhibit(not(G)) = 1;

ns = 2*ones(1,N);
dag = zeros(N,N);
dag(diseases, findings) = G;
bnet = mk_bnet(dag, ns);

for d=1:Ndiseases
  CPT = [1-prior(d) prior(d)];
  bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
end

for i=1:Nfindings
  bnet.CPD{fnode} = noisyor_CPD(bnet, findings(i), leak(i), inhibit(parents(G,i), i));
end

Let's visualize this graph structure using code contributed by Ali Taylan Cemgil from the University of Nijmegen:

draw_layout(bnet.dag);

Now let us put some random evidence on all the leaves except the very first and very last, and compute the disease posteriors.

pos = 2:floor(Nfindings/2);
neg = (pos(end)+1):(Nfindings-1);
onodes = myunion(pos, neg);
evidence = cell(1, N);
evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
engine = jtree_inf_engine(bnet, onodes);
[engine, ll] = enter_evidence(engine, evidence);
post = zeros(1, Ndiseases);
for i=diseases(:)'
  m = marginal_nodes(engine, i);
  post(i) = m.T(2);
end

The junction tree algorithm is quite slow on the QMR network, since the cliques are so big. One simple trick we can use is to notice that hidden leaves do not affect the posteriors on the roots, and hence do not need to be included in the network. A second trick is to notice that the negative findings can be "absorbed" into the prior: see the file BNT/examples/static/mk_minimal_qmr_bnet for details. A much more significant speedup is obtained by exploiting special properties of the noisy-or node, as done by the quickscore algorithm. For details, see

Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
Rish and Dechter, "On the impact of causal independence", UCI tech report, 1998.

This has been implemented in BNT as a special-purpose inference engine, which can be created and used as follows:

engine = quickscore_inf_engine(inhibit, leak, prior);
engine = enter_evidence(engine, pos, neg);
m = marginal_nodes(engine, i);

Even using quickscore, exact inference takes time that is exponential in the number of positive findings. Hence for large networks we need to resort to approximate inference techniques. See for example

T. Jaakkola and M. Jordan, "Variational probabilistic inference and the QMR-DT network", JAIR 10, 1999.
K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study", UAI 99.

Some of these approximate inference algorithms have been implemented in BNT. See the file BNT/examples/static/qmr1 for a comparison.

Variable elimination

As another example of an inference engine, besides junction tree, we now explain variable elimination, which is appealingly simple. Consider the problem of computing the likelihood term, Pr(W), in the water sprinkler network. We can use the factored representation of the joint to do this efficiently. The key idea is to "push sums in" as far as possible when summing (marginalizing) out irrelevant terms, e.g.,

Notice that, as we perform the innermost sums, we create new terms, which need to be summed over in turn e.g.,

where

Continuing in this way,

where

This algorithm is called Variable Elimination. For details, see

R. Dechter, "Bucket Elimination: A Unifying Framework for Probabilistic Inference", UA1 96, pp. 211-219.
Z. Li and B. D'Ambrosio, "Efficient inference in Bayes networks as a combinatorial optimization problem", Intl. J. Approximate Reasoning, 11(1):55-81, 1994

The principle of distributing sums over products can be generalized greatly to apply to any commutative semiring. This forms the basis of many common algorithms, such as Viterbi decoding and the Fast Fourier Transform. For details, see

R. McEliece and S. M. Aji, 1998. "The Generalized Distributive Law", Submitted to IEEE Trans. Inform. Theory.

The amount of work we perform when computing a marginal is bounded by the size of the largest term that we encounter. Choosing a summation (elimination) ordering to minimize this is NP-hard, although greedy algorithms work well in practice. The implementation of variable elimination in BNT makes no attempt to find a good ordering; however, the junction tree code implements a greedy heuristic that could be transferred over.

Inference engines

An inference engine is an object that contains a bnet and supports the 'enter_evidence' and 'marginal_nodes' methods. The engine constructor takes the bnet as argument and may do some model-specific processing. When 'enter_evidence' is called, the engine may do some evidence-specific processing. Finally, when 'marginal_nodes' is called, the engine may do some query-specific processing.

The amount of work done when each stage is specified -- structure, parameters, evidence, and query -- depends on the engine. The cost of work done early in this sequence can be amortized. On the other hand, one can make better optimizations when one knows what the parameters/evidence/query is. For example, the parameters might imply conditional indpendencies that are not evident in the graph structure, but can nevertheless be exploited; the evidence indicates which nodes are observed and hence can effectively be disconnected from the graph; and the query might indicate that large parts of the network are d-separated from the query nodes. (Since it is not the actual *values* of the evidence that matters, just which nodes are observed, many engines allow you to specify this when they are constructed, i.e., before calling 'enter_evidence'. Some engines can still cope if the actual pattern of evidence is different, e.g., if there is missing data.)

One important distinction is whether marginal_nodes is fast or not, since this is the function that is called most frequently. For example, when learning, we need to call marginal_nodes n times, where n is the number of nodes. Variable elimination would end up repeating a lot of work each time marginal_nodes is called, making it inefficient for learning. The junction tree algorithm, by contrast, calculates all marginals in two passes during 'enter_evidence', so calling 'marginal_nodes' takes constant time.

The inference engines differ in many other important ways. Here are some of the major "axes":

What work is done when the structure, parameter, evidence and queries are specified?
Designed for static or dynamic (temporal) models?
Exact or approximate inference?
Works for all topologies or makes restrictions?
Works for all node types or makes restrictions?
Handles any pattern of evidence, or must be fixed?

The node types deserve elaboration. When all the hidden nodes are discrete or Gaussian, we can perform exact (closed form) inference. If there are both discrete and Gaussian hidden nodes, we require that there not be any arcs from hidden continuous nodes to discrete (hidden or observed) nodes, i.e., no C->D arcs. To handle non-discrete, non-Gaussian distributions, we usually have to resort to sampling. However, even when we can solve the problem in closed form, we often need to use approximation algorithms for reasons of computational efficiency. We may also want to place additional restrictions to make the implementation simpler. We identify the following classes of restrictions:

all nodes must be discrete (ALLD)
all hidden nodes must be discrete (D)
all nodes must be Gaussian (ALLG)
all hidden nodes must be Gaussian (G)
hidden nodes may be discrete or Gaussian (CG)
CG case without the C->D restriction (CG+)
no restrictions on the hidden nodes (ANY)

Here is a list of the engines for static networks. Click on the name of the engine for more details, including references. We specify what processing happens when the model is specified, the evidence is specified, and the query is specified.

Name Exact? Node type? topology evid. model proc ev. proc query proc

cond_gauss exact CG all any compute joint condition joint marginalize joint
enumerative exact allD all any none none marginalize joint
gaussian exact allG all any compute joint condition joint marginalize joint
jtree exact D,G,CG all any find elim order propagate msgs marginalize clique
jtree_fast exact D all fixed find elim order propagate msgs marginalize clique
jtree_onepass exact D,G,CG any all find elim order propagate msgs lookup
jtree_onepass_fast exact D fixed all find elim order propagate msgs lookup
likelihood_weighting approx any all any none sample lookup
loopy_pearl approx D,G all any none propagate msgs lookup
pearl exact D,G polytree any none propagate msgs lookup
quickscore exact noisy-or QMR leaves none marginalize root nodes lookup
var_elim exact D,G,CG all any none none nested marginalizations

Documentation last updated: 22 May 2000

Name	Exact?	Node type?	topology	evid.	model proc	ev. proc	query proc
cond_gauss	exact	CG	all	any	compute joint	condition joint	marginalize joint
enumerative	exact	allD	all	any	none	none	marginalize joint
gaussian	exact	allG	all	any	compute joint	condition joint	marginalize joint
jtree	exact	D,G,CG	all	any	find elim order	propagate msgs	marginalize clique
jtree_fast	exact	D	all	fixed	find elim order	propagate msgs	marginalize clique
jtree_onepass	exact	D,G,CG	any	all	find elim order	propagate msgs	lookup
jtree_onepass_fast	exact	D	fixed	all	find elim order	propagate msgs	lookup
likelihood_weighting	approx	any	all	any	none	sample	lookup
loopy_pearl	approx	D,G	all	any	none	propagate msgs	lookup
pearl	exact	D,G	polytree	any	none	propagate msgs	lookup
quickscore	exact	noisy-or	QMR	leaves	none	marginalize root nodes	lookup
var_elim	exact	D,G,CG	all	any	none	none	nested marginalizations