To specify this directed acyclic graph (dag), we create an adjacency matrix:
N = 4; dag = zeros(N,N); C = 1; S = 2; R = 3; W = 4; dag(C,[R S]) = 1; dag(R,W) = 1; dag(S,W)=1;
We have numbered the nodes as follows: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. The nodes must always be numbered in topological order, i.e., ancestors before descendants. For a more complicated graph, this is a little inconvenient: we will see how to get around this below.
In addition to specifying the graph structure, we must specify the size and type of each node. If a node is discrete, its size is the number of possible values each node can take on; if a node is continuous, it can be a vector, and its size is the length of this vector. In this case, we will assume all nodes are discrete and binary.
discrete_nodes = 1:N; node_sizes = 2*ones(1,N);We are now ready to make the Bayes net
bnet = mk_bnet(dag, node_sizes, discrete_nodes);There are several optional arguments to the 'mk_bnet' function, some of which we will see later. In general, to find out more about a function, please see its documentation string by typing
help mk_bnetSee also other useful Matlab tips.
A model consists of the graph structure and the parameters. The parameters are represented by CPD objects (CPD = Conditional Probability Distribution), which define the probability distribution of a node given its parents. (We will use the terms "node" and "random variable" interchangeably.) The simplest kind of CPD is a table (multi-dimensional array), which is suitable when all the nodes are discrete-valued. Note that the discrete values are not assumed to be ordered in any way; that is, they represent categorical quantities, like male and female, rather than ordinal quantities, like low, medium and high.
Tabular CPDs, also called CPTs (conditional probability tables), are stored as multidimensional arrays, where the dimensions are arranged in the same order as the nodes, e.g., the CPT for node 4 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. Hence the child is always the last dimension. If a node has no parents, its CPT is a column vector representing its prior. Note that in Matlab (unlike C), arrays are indexed from 1, and are layed out in memory such that the first index toggles fastest, e.g., the CPT for node 4 (WetGrass) is as follows
where we have used the convention that false==1, true==2. We can create this CPT in Matlab as follows
CPT = zeros(2,2,2); CPT(1,1,1) = 1.0; CPT(2,1,1) = 0.1; ...Here is an easier way:
CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);In fact, we don't need to reshape the array, since the CPD constructor will do that for us. So we can just write
bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]); bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]); bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);If we omit the third argument to tabular_CPD, random parameters will be created. To ensure repeatable results, use
rand('state', seed); randn('state', seed);
engine = jtree_inf_engine(bnet);The other engines have similar constructors, but might take additional, algorithm-specific parameters. All engines are used in the same way, once they have been created. We illustrate this in the following sections.
Suppose we want to compute the probability that the sprinker was on given that the grass is wet. The evidence consists of the fact that W=2. All the other nodes are hidden (unobserved). We can specify this as follows.
evidence = cell(1,N); evidence{W} = 2;We use a 1D cell array instead of a vector to cope with the fact that nodes can be vectors of different lengths. In addition, the value [] can be used to denote 'no evidence', instead of having to specify the observation pattern as a separate argument.
We are now ready to add the evidence to the engine.
[engine, loglik] = enter_evidence(engine, evidence);The behavior of this function is algorithm-specific, and is discussed in more detail below. In the case of the jtree engine, enter_evidence implements a two-pass message-passing scheme. The first return argument contains the modified engine, which incorporates the evidence. The second return argument contains the log-likelihood of the evidence. (Not all engines are capable of computing the log-likelihood.)
Finally, we can compute p=P(S=2|W=2) as follows.
marg = marginal_nodes(engine, S); p = marg.T(2);We find that p = 0.4298.
Now let us add the evidence that it was raining, and see what difference it makes.
evidence{R} = 2; [engine, loglik] = enter_evidence(engine, evidence); marg = marginal_nodes(engine, S); p = marg.T(2);We find that p = P(S=2|W=2,R=2) = 0.1945, which is lower than before, because the sprinkler's being on can ``explain away'' the fact that the grass is wet.
evidence = cell(1,N); [engine, ll] = enter_evidence(engine, evidence); m = marginal_nodes(engine, [S R W]); >> m = domain: [2 3 4] T: [2x2x2 double] >> m.T ans(:,:,1) = 0.2900 0.0410 0.0210 0.0009 ans(:,:,2) = 0 0.3690 0.1890 0.0891m is a structure. The 'T' field is a multi-dimensional array (in this case, 3-dimensional) that contains the joint probability distribution on the specified nodes. We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass to be wet if both the rain and sprinkler are off.
Let us now add some evidence to R.
evidence{R} = 2; [engine, ll] = enter_evidence(engine, evidence); m = marginal_nodes(engine, [S R W]); m = domain: [2 3 4] T: [2x1x2 double] >> m.T m.T ans(:,:,1) = 0.0820 0.0018 ans(:,:,2) = 0.7380 0.1782We see The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible with the evidence that R=2. Instead of creating large tables with many 0s, BNT sets the effective size of observed (discrete) nodes to 1. This is why m.T has size 2x1x2. We can convert this back to a 2x2x2 table as follows.
fullm = add_ev_to_dmarginal(m, evidence, ns)
Note: It is not always possible to compute the joint on arbitrary sets of nodes: it depends on the engine, as discussed in more detail below.
nsamples = 50; samples = sample_bnet(bnet, nsamples); % hide 50% of the nodes samples2 = samples; hide = rand(N, nsamples) > 0.5; [I,J]=find(hide); for k=1:length(I) samples2{I(k), J(k)} = []; endEach column of 'samples' is a training case. Now we create a network with random parameters.
% Make a tabular rasa bnet2 = mk_bnet(dag, ns); seed = 0; rand('state', seed); bnet2.CPD{C} = tabular_CPD(bnet2, C); bnet2.CPD{R} = tabular_CPD(bnet2, R); bnet2.CPD{S} = tabular_CPD(bnet2, S); bnet2.CPD{W} = tabular_CPD(bnet2, W); engine2 = jtree_inf_engine(bnet2);Finally, we find the maximum likelihood estimates of the parameters using EM.
max_iter = 10; [engine3, LL] = learn_params(engine2, samples2, max_iter);LL(i) is the log-likelihood at iteration i. We can plot this as follows:
plot(LL, 'x-')To view the learned parameters, we need to extract the bnet structure from the engine object, and then extract the parameters from each CPD object. I have not created a method for the latter, so we just use a little Matlab hackery.
bnet3 = bnet_from_engine(engine3); CPT3 = cell(1,N); for i=1:N s=struct(bnet3.CPD{i}); % violate object privacy CPT3{i}=s.CPT; endLet's display the results after 10 iterations of EM.
celldisp(CPT3) CPT3{1} = 0.6783 0.3217 CPT3{2} = 0.7796 0.2204 0.0531 0.9469 CPT3{3} = 0.6719 0.3281 0.8566 0.1434 CPT3{4} = (:,:,1) = 0.9762 0.5017 0.0931 0.0058 (:,:,2) = 0.0238 0.4983 0.9069 0.9942So we see that the learned parameters are fairly close to the original ones, which we display below for completeness.
CPT{1} = 0.5000 0.5000 CPT{2} = 0.5000 0.5000 0.9000 0.1000 CPT{3} = 0.8000 0.2000 0.2000 0.8000 CPT{4} = (:,:,1) = 1.0000 0.1000 0.1000 0.0100 (:,:,2) = 0 0.9000 0.9000 0.9900We can get improved performance by (1) increasing the size of the training set, (2) decreasing the amount of hidden data, (3) running EM for longer, and (4) using informative priors. As an example of the latter, let us specify a Dirichlet prior (pseudo counts) for each node which is equal to the true CPT multiplied by an effective sample size (prior). (If prior = 0, we are doing maximum likelihood estimation.) The [] argument in the following means 'generate random parameters'.
bnet2 = mk_bnet(dag, ns); seed = 0; rand('state', seed); prior = nsamples; bnet2.CPD{C} = tabular_CPD(bnet2, C, [], prior*CPT{C}); bnet2.CPD{R} = tabular_CPD(bnet2, R, [], prior*CPT{R}); bnet2.CPD{S} = tabular_CPD(bnet2, S, [], prior*CPT{S}); bnet2.CPD{W} = tabular_CPD(bnet2, W, [], prior*CPT{W});In this case, after 4 iterations of EM, we get
CPT3{1} = 0.5564 0.4436 CPT3{2} = 0.8076 0.1924 0.1950 0.8050 CPT3{3} = 0.5428 0.4572 0.9021 0.0979 CPT3{4} = (:,:,1) = 0.9998 0.1316 0.0985 0.0099 (:,:,2) = 0.0002 0.8684 0.9015 0.9901Note that by using a prior, EM is no longer guaranteed to increase the likelihood at each iteration; instead, it will be increasing the height of the posterior mode.
To add a uniform (uninformative) Dirichlet prior for regularization purposes, use something like this:
bnet2.CPD{W} = tabular_CPD(bnet2, W, [], prior*normalise(myones(ns[R S W])));This will avoid estimating a parameter as 0 just because there is insufficient training data. I recommend the following tutorials for more details on this kind of thing:
We can create this model as follows.
F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9; n = 9; dag = zeros(n); dag(F,E)=1; dag(W,[E Min D]) = 1; dag(E,D)=1; dag(B,[C D])=1; dag(D,[L Mout])=1; dag(Min,Mout)=1; % node sizes - all cts nodes are scalar, all discrete nodes are binary ns = ones(1, n); dnodes = [F W B]; cnodes = mysetdiff(1:n, dnodes); ns(dnodes) = 2; bnet = mk_bnet(dag, ns, dnodes);'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
The parameters of the discrete nodes can be specified as follows.
bnet.CPD{B} = tabular_CPD(bnet, B, [0.85 0.15]'); % 1=stable, 2=unstable bnet.CPD{F} = tabular_CPD(bnet, F, [0.95 0.05]'); % 1=intact, 2=defect bnet.CPD{W} = tabular_CPD(bnet, W, [2/7 5/7]'); % 1=industrial, 2=householdIn general, Gaussian nodes are parameterized as follows. Suppose the node is called Y, its continuous parents (if any) are called X, and its discrete parents (if any) are called Q. The the distribution on Y is defined as follows:
- no parents: Y ~ N(mu, Sigma) - cts parents : Y|X=x ~ N(mu + W x, Sigma) - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i)) - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) x, Sigma(:,:,i))where N(mu, Sigma) denotes a Normal distribution with mean mu and covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q respectively. If there are no discrete parents, |Q|=1; if there is more than one, then |Q| = a vector of the sizes of each discrete parent. If there are no continuous parents, |X|=0; if there is more than one, then |X| = the sum of their sizes. Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) matrix.
We can specify the parameters of the model as follows. (If we omit any parameters, random values of the appropriate size will be created.)
mu = reshape([-3.9 -0.4 -3.2 -0.5], [1 2 2]); Sigma = reshape([0.00002 0.0001 0.00002 0.0001], [1 1 2 2]); bnet.CPD{E} = gaussian_CPD(bnet, E, mu, Sigma); mu = reshape([6.5 6.0 7.5 7.0], [1 2 2]); Sigma = reshape([0.03 0.04 0.1 0.1], [1 1 2 2]); weights = reshape([1 1 1 1], [1 1 2 2]); bnet.CPD{D} = gaussian_CPD(bnet, D, mu, Sigma, weights); mu = reshape([-2 -1], [1 2]); Sigma = reshape([0.1 0.3], [1 1 2]); bnet.CPD{C} = gaussian_CPD(bnet, C, mu, Sigma); bnet.CPD{L} = gaussian_CPD(bnet, L, 3, 0.25, -0.5); mu = reshape([0.5 -0.5], [1 2]); Sigma = reshape([0.01 0.005], [1 1 2]); bnet.CPD{Min} = gaussian_CPD(bnet, Min, mu, Sigma); bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 0, 0.002, [1 1]);
engine = jtree_inf_engine(bnet); evidence = cell(1,n); [engine, ll] = enter_evidence(engine, evidence); marg = marginal_nodes(engine, E);'marg' is a structure that contains the fields 'mu' and 'Sigma', which contain the mean and (co)variance of the marginal on E. In this case, they are both scalars. Let us check they match the published figures to 2dp. (We can't expect more precision than this in general because I have implemented the algorithm of Lauritzen (1992), which can be numerically unstable.)
tol = 1e-2; assert(approxeq(marg.mu, -3.25, tol)); assert(approxeq(sqrt(marg.Sigma), 0.709, tol));We can compute the other posteriors similarly. Now let us add some evidence.
evidence = cell(1,n); evidence{W} = 1; % industrial evidence{L} = 1.1; evidence{C} = -0.9; [engine, ll] = enter_evidence(engine, evidence);Now we find
marg = marginal_nodes(engine, E); assert(approxeq(marg.mu, -3.8983, tol)); assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));We can also compute the joint probability on a set of nodes. For example, P(D, Mout | evidence) is a 2D Gaussian:
marg = marginal_nodes(engine, [D Mout]) marg = domain: [6 8] mu: [2x1 double] Sigma: [2x2 double] T: 1.0000The mean is
marg.mu ans = 3.6077 4.1077and the covariance matrix is
marg.Sigma ans = 0.1062 0.1062 0.1062 0.1182It is easy to visualize this posterior using standard Matlab plotting functions. The T field indicates that the mixing weight of this Gaussian component is 1.0. If the joint contains discrete and continuous variables, the result will be a mixture of Gaussians, e.g.,
marg = marginal_nodes(engine, [F E]) domain: [1 3] mu: [-3.9000 -0.4003] Sigma: [1x1x2 double] T: [0.9995 4.7373e-04]
BNT sets the effective size of observed continuous nodes to 0, and the effective size of observed discrete nodes to 1, since they have known values, and hence can effectively be removed from the graph. For example,
marg = marginal_nodes(engine, [B C]) domain: [4 5] mu: [] Sigma: [] T: [0.0123 0.9877]It is easy to convert this to change this way of handling evidence: see the file BNT/examples/static/cg1 for an example.
![]() | ![]() |
X is the observed input, Y is the output, and the Q nodes are hidden "gating" nodes, which select the appropriate set of parameters for Y. During training, Y is assumed observed, but for testing, the goal is to predict Y given X. Note that this is a conditional density model, so we don't associate any parameters with X. Hence X's CPD will be a root CPD, which is a way of modelling exogenous nodes. If the output is a continuous-valued quantity, we assume the "experts" are linear-regression units, and set Y's CPD to linear-Gaussian. If the output is discrete, we set Y's CPD to a softmax function. The Q CPDs will always be softmax functions.
The softmax distribution (also known as the multinomial logit function) is as follows:
exp(w(:,i)'*x + b(i)) Pr(Q=i | X=x) = ----------------------------- sum_j exp(w(:,j)'*x + b(j))The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the following interpretation: w(:,i)-w(:,j) is the normal vector to the decision boundary between classes i and j, and b(i)-b(j) is its offset (bias). For example, suppose X is a 2-vector, and Q is binary. Then
w = [1 -1; 0 0]; b = [0 0];means class 1 are points in the 2D plane with positive x coordinate, and class 2 are points in the 2D plane with negative x coordinate. If w has large magnitude, the decision boundary is sharp, otherwise it is soft.
In the special case that Q is binary (0/1), the softmax function reduces to the logistic (sigmoid) function as follows (we drop the offset term b for simplicity):
mu = E[Q | X=x] = Pr(Q = 1 | X=x) exp(w(:,1)'*x) = -------------------------------------------- exp(w(:,1)'*x) + exp(w(:,0)'*x) exp(0) = --------------------------------------------- exp(0) + exp([w(:,0)-w(:,1)]'*x) 1 = ----------------------------- 1 + exp(-theta'*x)where theta = w(:,1) - w(:,0). Hence in general
exp( [theta'*x] Q ) Pr(Q=i | X=x) = ----------------------------- 1 + exp(theta'*x)Another way of writing this is as follows
mu ln ----- = theta' * x 1-muwhere the LHS is the logit function (ln of the log-odds ratio). This is how logistic regression is usually defined.
Fitting a softmax function can be done using the iteratively reweighted least squares (IRLS) algorithm. We use the implementation from netlab. Note that since the softmax distribution is not in the exponential family, it does not have finite sufficient statistics, and hence we must store all the training data in uncompressed form. If this takes too much space, one should use online (stochastic) gradient descent (not implemented in BNT).
For more details on the softmax function, the hierarchical mixtures of experts model and logistic regression respectively, I recommend the following:
As a concrete example, consider the mixture of experts model where X and Y are scalars, and Q is binary. This is just piecewise linear regression, where we have two line segments, i.e.,
We can create this model with random parameters as follows.
X = 1; Q = 2; Y = 3; dag = zeros(3,3); dag(X,[Q Y]) = 1 dag(Q,Y) = 1; ns = [1 2 1]; % make X and Y scalars, and have 2 experts dnodes = [2]; onodes = [1 3]; % observed nodes bnet = mk_bnet(dag, ns, dnodes); rand('state', 0); randn('state', 0); bnet.CPD{1} = root_CPD(bnet, 1); bnet.CPD{2} = softmax_CPD(bnet, 2); bnet.CPD{3} = gaussian_CPD(bnet, 3);Now let us fit this model using EM. First we load the 200 training cases and plot them.
load data -ascii; plot(data(:,1), data(:,2), '.');
This is what the model looks like before training. (Thanks to Thomas Hoffman for writing this plotting routine.)
Now let's train the model, and plot the final performance.
ncases = size(data, 1); % each row of data is a training case cases = cell(3, ncases); cases([1 3], :) = num2cell(data'); % each column of cases is a training case engine = jtree_inf_engine(bnet, onodes); max_iter = 20; [engine2, LL2] = learn_params(engine, cases, max_iter);(We specify which nodes will be observed when we create the engine. Hence BNT knows that the hidden nodes are all discrete. For complex models, this can lead to a significant speedup.) Below we show what the model looks like after 16 iterations of EM (with 100 IRLS iterations per M step), when it converged using the default convergence tolerance (that the fractional change in the log-likelihood be less than 1e-3)/ Before learning, the log-likelihood was -322.927442; afterwards, it was -13.728778.
![]() | ![]() | ![]() | ![]() |
(a) | (b) | (c) | (d) |
We can create this model in BNT as follows.
ns = [k D]; dag = zeros(2,2); dag(1,2) = 1; dnodes = []; onodes = 2; bnet = mk_bnet(dag, ns, dnodes); bnet.CPD{1} = gaussian_CPD(bnet, 1, zeros(k,1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov'); bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(D,1), diag(Psi0), W0, 'diag', 'untied', 'clamp_mean');The root node is clamped to the N(0,I) distribution, so that we will not update these parameters during learning. The mean of the leaf node is clamped to 0, since we assume the data has been centered (had its mean subtracted off); this is just for simplicity. Finally, the covariance of the leaf node is constrained to be diagonal. W0 and Psi0 are the initial parameter guesses.
We can fit this model using EM in the usual way (see the file BNT/examples/static/fa1 for details). Not surprisingly, the ML estimates for mu and Psi turn out to be identical to the sample mean and variance, which can be computed directly as
mu_ML = mean(data); Psi_ML = diag(cov(data));Note that W can only be identified up to a rotation matrix, because of the spherical symmetry of the source.
If we restrict Psi to be spherical, i.e., Psi = sigma I, there is a closed-form solution for W as well, i.e., we do not need to use EM. In particular, W contains the first |X| eigenvectors of the sample covariance matrix, with scalings determined by the eigenvalues and sigma. Classical PCA can be obtained by takting the sigma->0 limit. For details, see
By adding a hidden discrete variable, we can create mixtures of FA models, as shown in (c). Now we can explain the data using a set of subspaces. We can create this model in BNT as follows.
ns = [M k D]; dag = zeros(3); dag(1,3) = 1; dag(2,3) = 1; dnodes = 1; onodes = 3; bnet = mk_bnet(dag, ns, dnodes); bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0); bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(k, 1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov'); bnet.CPD{3} = gaussian_CPD(bnet, 3, Mu0', repmat(diag(Psi0), [1 1 M]), W0, 'diag', 'tied');Notice how the covariance matrix for Y is the same for all values of Q; that is, the noise level in each sub-space is assumed the same. However, we allow the offset, mu, to vary. For details, see
I have included Zoubin's specialized MFA code (with his permission) with the toolbox, so you can check that BNT gives the same results: see 'BNT/examples/static/mfa1.m'.
Independent Factor Analysis (IFA) generalizes FA by allowing a non-Gaussian prior on each component of X. (Note that we can approximate a non-Gaussian prior using a mixture of Gaussians.) This means that the likelihood function is no longer rotationally invariant, so we can uniquely identify W and the hidden sources X. IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). We recover classical Independent Components Analysis (ICA) in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the weight matrix W is square and invertible. For details, see
The diseases have Bernoulli CPDs, and the findings have noisy-or CPDs. A noisy-OR node is like a regular logical OR gate except that sometimes the effects of parents that are on get inhibited. Let the prob. that parent i gets inhibited be q(i). Then a node, C, with 2 parents, A and B, has the following CPD, where we use F and T to represent off and on (1 and 2 in BNT).
A B P(C=off) P(C=on) --------------------------- F F 1.0 0.0 T F q(A) 1-q(A) F T q(B) 1-q(B) T T q(A)q(B) q-q(A)q(B)Thus we see that the causes get inhibited independently. It is common to associate a "leak" node with a noisy-or CPD, which is like a parent that is always on. This can account for all other unmodelled causes which might turn the node on.
The noisy-or distribution is similar to the logistic distribution defined earlier. To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) be the prob. that j inhibits i. Then
Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))For a sigmoid node, we have
Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of the activation function (although both are monotonically increasing). In addition, in the case of a noisy-or, the weights are constrained to be positive, since they derive from probabilities q(i,j).
We can create a random QMR-like model as follows. (We are not allowed to publish the actual parameters, since they are proprietary.)
pMax = 0.01; Nfindings = 10; Ndiseases = 5; N=Nfindings+Ndiseases; findings = Ndiseases+1:N; diseases = 1:Ndiseases; % G(i,j) = 1 iff there is an arc from disease i to finding j G = zeros(Ndiseases, Nfindings); for i=1:Nfindings v= rand(1,Ndiseases); rents = find(v<0.8); if (length(rents)==0) rents=ceil(rand(1)*Ndiseases); end G(rents,i)=1; end % prior(i) = prob. disease i is on prior = pMax*rand(1,Ndiseases); % leak(j) = inhibition prob. on leak->j arc leak = 0.5*rand(1,Nfindings); % in real QMR, leak approx exp(-0.02) = 0.98 %leak = ones(1,Nfindings); % turns off leaks % inhibit(i,j) = inhibition probability on i->j arc inhibit = rand(Ndiseases, Nfindings); inhibit(not(G)) = 1; ns = 2*ones(1,N); dag = zeros(N,N); dag(diseases, findings) = G; bnet = mk_bnet(dag, ns); for d=1:Ndiseases CPT = [1-prior(d) prior(d)]; bnet.CPD{d} = tabular_CPD(bnet, d, CPT'); end for i=1:Nfindings bnet.CPD{fnode} = noisyor_CPD(bnet, findings(i), leak(i), inhibit(parents(G,i), i)); endLet's visualize this graph structure using code contributed by Ali Taylan Cemgil from the University of Nijmegen:
draw_layout(bnet.dag);
Now let us put some random evidence on all the leaves except the very first and very last, and compute the disease posteriors.
pos = 2:floor(Nfindings/2); neg = (pos(end)+1):(Nfindings-1); onodes = myunion(pos, neg); evidence = cell(1, N); evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos))); evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg))); engine = jtree_inf_engine(bnet, onodes); [engine, ll] = enter_evidence(engine, evidence); post = zeros(1, Ndiseases); for i=diseases(:)' m = marginal_nodes(engine, i); post(i) = m.T(2); endThe junction tree algorithm is quite slow on the QMR network, since the cliques are so big. One simple trick we can use is to notice that hidden leaves do not affect the posteriors on the roots, and hence do not need to be included in the network. A second trick is to notice that the negative findings can be "absorbed" into the prior: see the file BNT/examples/static/mk_minimal_qmr_bnet for details. A much more significant speedup is obtained by exploiting special properties of the noisy-or node, as done by the quickscore algorithm. For details, see
engine = quickscore_inf_engine(inhibit, leak, prior); engine = enter_evidence(engine, pos, neg); m = marginal_nodes(engine, i);Even using quickscore, exact inference takes time that is exponential in the number of positive findings. Hence for large networks we need to resort to approximate inference techniques. See for example
Notice that, as we perform the innermost sums, we create new terms, which need to be summed over in turn e.g.,
where
Continuing in this way,
where
This algorithm is called Variable Elimination. For details, see
The principle of distributing sums over products can be generalized greatly to apply to any commutative semiring. This forms the basis of many common algorithms, such as Viterbi decoding and the Fast Fourier Transform. For details, see
The amount of work we perform when computing a marginal is bounded by the size of the largest term that we encounter. Choosing a summation (elimination) ordering to minimize this is NP-hard, although greedy algorithms work well in practice. The implementation of variable elimination in BNT makes no attempt to find a good ordering; however, the junction tree code implements a greedy heuristic that could be transferred over.
The amount of work done when each stage is specified -- structure, parameters, evidence, and query -- depends on the engine. The cost of work done early in this sequence can be amortized. On the other hand, one can make better optimizations when one knows what the parameters/evidence/query is. For example, the parameters might imply conditional indpendencies that are not evident in the graph structure, but can nevertheless be exploited; the evidence indicates which nodes are observed and hence can effectively be disconnected from the graph; and the query might indicate that large parts of the network are d-separated from the query nodes. (Since it is not the actual *values* of the evidence that matters, just which nodes are observed, many engines allow you to specify this when they are constructed, i.e., before calling 'enter_evidence'. Some engines can still cope if the actual pattern of evidence is different, e.g., if there is missing data.)
One important distinction is whether marginal_nodes is fast or not, since this is the function that is called most frequently. For example, when learning, we need to call marginal_nodes n times, where n is the number of nodes. Variable elimination would end up repeating a lot of work each time marginal_nodes is called, making it inefficient for learning. The junction tree algorithm, by contrast, calculates all marginals in two passes during 'enter_evidence', so calling 'marginal_nodes' takes constant time.
The inference engines differ in many other important ways. Here are some of the major "axes":
The node types deserve elaboration. When all the hidden nodes are discrete or Gaussian, we can perform exact (closed form) inference. If there are both discrete and Gaussian hidden nodes, we require that there not be any arcs from hidden continuous nodes to discrete (hidden or observed) nodes, i.e., no C->D arcs. To handle non-discrete, non-Gaussian distributions, we usually have to resort to sampling. However, even when we can solve the problem in closed form, we often need to use approximation algorithms for reasons of computational efficiency. We may also want to place additional restrictions to make the implementation simpler. We identify the following classes of restrictions:
Here is a list of the engines for static networks. Click on the name of the engine for more details, including references. We specify what processing happens when the model is specified, the evidence is specified, and the query is specified.
Name | Exact? | Node type? | topology | evid. | model proc | ev. proc | query proc |
cond_gauss | exact | CG | all | any | compute joint | condition joint | marginalize joint |
enumerative | exact | allD | all | any | none | none | marginalize joint |
gaussian | exact | allG | all | any | compute joint | condition joint | marginalize joint |
jtree | exact | D,G,CG | all | any | find elim order | propagate msgs | marginalize clique |
jtree_fast | exact | D | all | fixed | find elim order | propagate msgs | marginalize clique |
jtree_onepass | exact | D,G,CG | any | all | find elim order | propagate msgs | lookup |
jtree_onepass_fast | exact | D | fixed | all | find elim order | propagate msgs | lookup |
likelihood_weighting | approx | any | all | any | none | sample | lookup |
loopy_pearl | approx | D,G | all | any | none | propagate msgs | lookup |
pearl | exact | D,G | polytree | any | none | propagate msgs | lookup |
quickscore | exact | noisy-or | QMR | leaves | none | marginalize root nodes | lookup |
var_elim | exact | D,G,CG | all | any | none | none | nested marginalizations |
Documentation last updated: 22 May 2000