An Introduction to Bayesian Networks

For a non-technical introduction, read this LA times article (10/28/96). For a somewhat more technical introduction, see below. For the gory details, see the AUAI homepage (Association for Uncertainty in Artificial Intelligence), or my list of recommended reading at the end. The UAI mailing list is now archived and provides up-to-date information. I also maintain a list of free BN software packages.

"Graphical models are a marriage of between probability theory and graph theory. They provide a natural tool for dealng with two problems that occur throughout applied mathematics and engineering -- uncertainty and complexity -- and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the ida of a graphical model is the notion of modularity -- a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.

Many of the classical multivariate probabalisitc systems studied in fields such as statistcics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism -- examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages -- in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems." --- M. Jordan, 1998.

In the following sections, we will give a brief introduction to directed graphical models, also called Bayesian networks or belief networks. In particular, I will explain 3 main aspects of BNs.

Representation, or, what does a BN model actually mean?
Inference, or, how can we use these models to efficiently answer probabilistic queries?
Learning, or, what do we do if we don't know what the model is?

Representation

Bayesian Networks (also called probabilistic networks or belief networks) are a graphical way of representing (in)dependence relationships. Nodes in the graph represent random variables, and we draw an arc from A to B if A "directly influences" B (we will give the formal semantics later). Here is a simple example.

In this picture (from Russell and Norvig, 1995), we see that the event "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it is raining (R=true). The probability that the W node is true depends on the state of W's parents. This conditional probability distribution (CPD) can be specified in tabular form as shown in the picture. For example, we see that Pr(W=true | S=true, R=false) = 0.9 (and hence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1). Since the C node has no parents, its CPD specifies the prior probability that it is cloudy.

The formal independence assumption encoded in the graph is that a node is conditionally independent of all its non-descendants given its parents: this is called the directed Markov property (see Pearl, 1988). This means that if we topologically order the nodes, we can represent the joint probability distribution (JPD) as a product of local factors using the chain rule. For example, referring to sprinkler network,

Pr(C=c, S=s, R=r, W=w) = Pr(C=c) * Pr(S=s|C=c) * Pr(R=r|C=c) * Pr(W=w|S=s,R=r)

Dynamic Bayesian Networks (DBNs)

When the random variables change over time (a stochastic process), we use a Dynamic Bayesian Network (DBN). Kalman filter models and Hidden Markov Models (HMMs) are special cases of DBNs in which we assume there is a single (possibly vector-valued) state variable. When there are many loosely-coupled discrete state variables, DBNs are a more efficient way of representing the system. If, however, there is a single variable, which undergoes a sequence of state transitions (as in a stochastic automaton), HMMs are a better choice.

A generic DBN model has two components: The state evolution model describes Pr(X(t) | X(t-1)), and the observation model describes Pr(O(t) | X(t)). In the case of a Kalman filter model, we assume the state evolution model and observation models are linear functions subject to Gaussian noise. In the case of an HMM, the system can have arbitrary, non-linear dynamics, although the number of parameters is exponential in the size of the state space, |X_t|. A DBN provides a means to represent the transition and observation functions in a compact form (i.e., using fewer parameters). This can make inference and learning more efficient (less time and less data, respectively).

Inference

The most common task we wish to solve using Bayesian networks is probabilistic inference. For example, suppose we observe the fact that the grass is wet. There are two possible causes for this: either it is raining, or the sprinkler is on. Which is more likely? We can use Bayes' rule to compute the posterior probability of each explanation.

Pr(S=t | W=t) = Pr(S=t, W=t) / Pr(W=t) = sum_{c, r} Pr(C=c,S=t,R=r,W=t) / Pr(W=t) = 0.2781 / 0.6471
Pr(R=t | W=t) = Pr(S=t, W=t) / Pr(W=t) = sum_{c, r} Pr(C=c,S=t,R=r,W=t) / Pr(W=t) = 0.4581 / 0.6471
where
Pr(W=t) = sum_{c, r, s} Pr(C=c,S=s,R=r,W=t) = 0.6471
is a normalizing constant, equal to the probability (likelihood) of the data.

So we see that it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.4581/0.2781 = 1.447. Notice that the two causes "compete" to "explain" the observed data. Hence S and R become conditionally dependent given that their common child, W, is observed, even though they are marginally independent.

In the above example, we had evidence of an effect (wet grass), and inferred the most likely cause. This is called diagnostic reasoning: from effects to causes ("bottom up"), and is a common task in expert systems. Bayes nets can also be used for causal "top down" reasoning. For example, we can compute the probability that the grass will be wet given that it is cloudy. In this case, we must be careful not to double-count evidence: the information from S and R flowing into W is not independent, because it came from a common cause, C.

In a DBN, the goal is to compute the probability of the hidden state X_t given all the evidence so far, i.e., Pr(X_t | E_1, ..., E_t) (filtering), or (in off-line mode) given all the evidence i.e., Pr(X_t | E_1, ..., E_T) for T>t (smoothing). We might also be interested in predicting the future, Pr(X_T | E_1, ..., E_t).

Efficient computation of marginal probabilities

Once we have the complete joint probability distribution (JPD), we can answer all possible inference queries by marginalization (summing out over irrelevant variables), as illustrated above. However, the JPD has size O(2^n), where n is the number of nodes, and we have assumed each node can have 2 states. Hence summing over the JPD takes exponential time.

However, we can exploit the factored nature of the JPD to compute marginals more efficiently. The key idea is "push sums in" as far as possible when summing (marginalizing) out irrelevant terms, e.g.,

Notice that, as we perform the innermost sums, we create new terms, which need to be summed over in turn e.g.,

Pr(W=w) = sum_c Pr(C=c) * sum_s Pr(S=s|C=c) * T1(c,w,s)
where
T1(c,w,s) = sum_r Pr(R=r|C=c) * Pr(W=w|S=s,R=r)

A convenient way to handle the book-keeping arising from all these intermediate terms is by using the bucket elimination algorithm (see the JavaBayes homepage for a good explanation and an implementation you can play with).

The amount of work we perform when computing a marginal is bounded by the size of the largest term that we encounter. Choosing a summation (elimination) ordering to minimize this is NP-hard, although greedy algorithms work well in practice (see Kjaerulff, 1990).

If we wish to compute several marginals at the same time, we can use Dynamic Programming to avoid repeated computation. If the underlying undirected graph of the BN is acyclic (i.e. a tree), we can use a local message passing algorithm due to Pearl (see Peot and Shachter, 1991). In the case of DBNs with the structure shown above (which is acyclic), Pearl's algorithm is equivalent to the forwards-backwards algorithm for HMMs, or to the Kalman-RTS smoother for LDSs (see Ghahramani, 1997).

If the BN has undirected cycles (as in the sprinkler example), we need to first create an auxiliary data-structure, called a join tree, and then apply the DP algorithm to the join tree (see Jensen, 1996 and Shafer and Lauritzen, 1996). The nodes of a join tree are subsets of nodes of the original BN, and satisfy the following property: if x is a member of join tree nodes i and j, then x must be a member of every node on the path between i and j. This property ensures that local propogation of information leads to global consistency.

We can create the join tree by moralizing and triangulating the BN graph (see Kjaerulff, 1990), and then finding the maximum cliques. Triangulation ensures that there are no chordless cycles of length greater than 4. Intuitively, triangulation connects together those nodes that feature in a common term when summing out; in the example above, we connect C to W and get the clique CWS, corresponding to the intermediate term T1(c,w,s). The process of moralization connects togther unmarried parents who share a common child: this ensures there will be a clique that contains terms of the form Pr(W|S,R).

Learning

One needs to specify two things to describe a BN: the graph topology (structure) and the parameters of each CPD. It is possible to learn both of these from data. The basic idea is to use maximum likelihood (ML) to find the parameters, and penalized ML to find the best structure (to avoid overfitting) (see Heckerman, 1996 and Krause, 1998). The penalty can be expressed as a prior, which says simpler models are more likely, or in terms of the Minimum Description Length principle.

If the value of every node is observed (the complete data case), finding the ML parameters amounts to counting, e.g.,

Pr(W=w|S=s,R=r) \approx N(W=w,S=s,R=r) / N(S=s,R=r)

where N(W=w,S=s,R=r) is the number of times this event occurs in the training set. (In the case of Gaussian distributions, the sufficient statistics are the sample mean and variance.) We can find the MAP structure by local search through model space.

If we have missing information, we need to perform inference to compute the expected sufficient statistics. We can then use EM to find the ML parameters, and SEM to find the MAP topology (see Friedman, 1997).