"Graphical models are a marriage of between probability theory and
graph theory. They provide a natural tool for dealng with two problems
that occur throughout applied mathematics and engineering --
uncertainty and complexity -- and in particular they are playing an
increasingly important role in the design and analysis of machine
learning algorithms. Fundamental to the ida of a graphical model is
the notion of modularity -- a complex system is built by combining
simpler parts. Probability theory provides the glue whereby the parts
are combined, ensuring that the system as a whole is consistent, and
providing ways to interface models to data. The graph theoretic side
of graphical models provides both an intuitively appealing interface
by which humans can model highly-interacting sets of variables as well
as a data structure that lends itself naturally to the design of
efficient general-purpose algorithms.
Many of the classical multivariate probabalisitc systems studied in
fields such as statistcics, systems engineering, information theory,
pattern recognition and statistical mechanics are special cases of the
general graphical model formalism -- examples include mixture models,
factor analysis, hidden Markov models, Kalman filters and Ising
models. The graphical model framework provides a way to view all of
these systems as instances of a common underlying formalism. This view
has many advantages -- in particular, specialized techniques that have
been developed in one field can be transferred between research
communities and exploited more widely. Moreover, the graphical model
formalism provides a natural framework for the design of new systems."
In the following sections, we will give a brief introduction to directed graphical models, also called Bayesian networks or belief networks. In particular, I will explain 3 main aspects of BNs.
In this picture (from Russell and Norvig, 1995), we see that the event "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it is raining (R=true). The probability that the W node is true depends on the state of W's parents. This conditional probability distribution (CPD) can be specified in tabular form as shown in the picture. For example, we see that Pr(W=true | S=true, R=false) = 0.9 (and hence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1). Since the C node has no parents, its CPD specifies the prior probability that it is cloudy.
The formal independence assumption encoded in the graph is that a node is conditionally independent of all its non-descendants given its parents: this is called the directed Markov property (see Pearl, 1988). This means that if we topologically order the nodes, we can represent the joint probability distribution (JPD) as a product of local factors using the chain rule. For example, referring to sprinkler network,
Pr(C=c, S=s, R=r, W=w) = Pr(C=c) * Pr(S=s|C=c) * Pr(R=r|C=c) * Pr(W=w|S=s,R=r)
A generic DBN model has two components: The state evolution model describes Pr(X(t) | X(t-1)), and the observation model describes Pr(O(t) | X(t)). In the case of a Kalman filter model, we assume the state evolution model and observation models are linear functions subject to Gaussian noise. In the case of an HMM, the system can have arbitrary, non-linear dynamics, although the number of parameters is exponential in the size of the state space, |X_t|. A DBN provides a means to represent the transition and observation functions in a compact form (i.e., using fewer parameters). This can make inference and learning more efficient (less time and less data, respectively).
Pr(S=t | W=t) = Pr(S=t, W=t) / Pr(W=t)
= sum_{c, r} Pr(C=c,S=t,R=r,W=t) / Pr(W=t) = 0.2781 / 0.6471
Pr(R=t | W=t) = Pr(S=t, W=t) / Pr(W=t)
= sum_{c, r} Pr(C=c,S=t,R=r,W=t) / Pr(W=t) = 0.4581 / 0.6471
where
Pr(W=t) = sum_{c, r, s} Pr(C=c,S=s,R=r,W=t) = 0.6471
is a normalizing constant, equal to the probability (likelihood) of
the data.
So we see that it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.4581/0.2781 = 1.447. Notice that the two causes "compete" to "explain" the observed data. Hence S and R become conditionally dependent given that their common child, W, is observed, even though they are marginally independent.
In the above example, we had evidence of an effect (wet grass), and inferred the most likely cause. This is called diagnostic reasoning: from effects to causes ("bottom up"), and is a common task in expert systems. Bayes nets can also be used for causal "top down" reasoning. For example, we can compute the probability that the grass will be wet given that it is cloudy. In this case, we must be careful not to double-count evidence: the information from S and R flowing into W is not independent, because it came from a common cause, C.
In a DBN, the goal is to compute the probability of the hidden state X_t given all the evidence so far, i.e., Pr(X_t | E_1, ..., E_t) (filtering), or (in off-line mode) given all the evidence i.e., Pr(X_t | E_1, ..., E_T) for T>t (smoothing). We might also be interested in predicting the future, Pr(X_T | E_1, ..., E_t).
However, we can exploit the factored nature of the JPD to compute marginals more efficiently. The key idea is "push sums in" as far as possible when summing (marginalizing) out irrelevant terms, e.g.,
Pr(W=w) = sum_c sum_s sum_r Pr(C=c) * Pr(S=s|C=c) * Pr(R=r|C=c) * Pr(W=w|S=s,R=r)
Pr(W=w) = sum_c Pr(C=c) * sum_s Pr(S=s|C=c) * sum_r Pr(R=r|C=c) * Pr(W=w|S=s,R=r)
Notice that, as we perform the innermost sums, we create new terms, which need to be summed over in turn e.g.,
Pr(W=w) = sum_c Pr(C=c) * sum_s Pr(S=s|C=c) * T1(c,w,s)
where
T1(c,w,s) = sum_r Pr(R=r|C=c) * Pr(W=w|S=s,R=r)
A convenient way to handle the book-keeping arising from all these intermediate terms is by using the bucket elimination algorithm (see the JavaBayes homepage for a good explanation and an implementation you can play with).
The amount of work we perform when computing a marginal is bounded by the size of the largest term that we encounter. Choosing a summation (elimination) ordering to minimize this is NP-hard, although greedy algorithms work well in practice (see Kjaerulff, 1990).
If we wish to compute several marginals at the same time, we can use Dynamic Programming to avoid repeated computation. If the underlying undirected graph of the BN is acyclic (i.e. a tree), we can use a local message passing algorithm due to Pearl (see Peot and Shachter, 1991). In the case of DBNs with the structure shown above (which is acyclic), Pearl's algorithm is equivalent to the forwards-backwards algorithm for HMMs, or to the Kalman-RTS smoother for LDSs (see Ghahramani, 1997).
If the BN has undirected cycles (as in the sprinkler example), we need to first create an auxiliary data-structure, called a join tree, and then apply the DP algorithm to the join tree (see Jensen, 1996 and Shafer and Lauritzen, 1996). The nodes of a join tree are subsets of nodes of the original BN, and satisfy the following property: if x is a member of join tree nodes i and j, then x must be a member of every node on the path between i and j. This property ensures that local propogation of information leads to global consistency.
We can create the join tree by moralizing and triangulating the BN graph (see Kjaerulff, 1990), and then finding the maximum cliques. Triangulation ensures that there are no chordless cycles of length greater than 4. Intuitively, triangulation connects together those nodes that feature in a common term when summing out; in the example above, we connect C to W and get the clique CWS, corresponding to the intermediate term T1(c,w,s). The process of moralization connects togther unmarried parents who share a common child: this ensures there will be a clique that contains terms of the form Pr(W|S,R).
If the value of every node is observed (the complete data case), finding the ML parameters amounts to counting, e.g.,
Pr(W=w|S=s,R=r) \approx N(W=w,S=s,R=r) / N(S=s,R=r)
where N(W=w,S=s,R=r) is the number of times this event occurs in the training set. (In the case of Gaussian distributions, the sufficient statistics are the sample mean and variance.) We can find the MAP structure by local search through model space.
If we have missing information, we need to perform inference to compute the expected sufficient statistics. We can then use EM to find the ML parameters, and SEM to find the MAP topology (see Friedman, 1997).
The key advantage of BNs is that they have a small number of parameters, and hence require less data to learn. For example, we can specify the joint distribution over n nodes using only O(n * 2^k) parameters (where k is the maximum fan-in of any node), instead of the O(2^n) which would be required for an explicit representation.