Hybrid graphical models in BNT

A Bayes net with discrete and continuous variables is sometimes called a "hybrid" model. As always, there are three considerations:

representation - how does the user specify the conditional probability distributions (CPD) at each node?
inference - which algorithm do we use? can it be done exactly?
learning - how do we estimate the parameters of each CPD?

Representation

BNT supports the following CPDs:

Child	Parent	Distribution
C	C	linear Gaussian
C	D	mixture of Gaussians
D	D	tabular (multinomial), noisy-or, several others
D	C	logistic (sigmoid), softmax

However, whether BNT can do inference on the model depends on the details of how the nodes are connected, and on which nodes are observed (see below).

Inference

In the current implementation of BNT, you must use the junction tree algorithm for inference if your model is hybrid. (It is possible to extend Pearl's algorithm, and of course sampling algorithms, to the hybrid case, but this has not yet been implemented.)

When using the jtree algorithm, the fundamental data structure is the "potential", which is a way of representing joint probability distributions. There must be a way to convert each CPD to a potential, and to incorporate evidence into it. In addition, each potential must support the operations of multiplication, division and summation (marginalization). The following potentials are supported in BNT: Gaussians (G), tables (T), and conditonal Gaussian (CG). Note that we use "weak" marginalization for CG distributions, which is not exact. In addition, the logistic distributions is approximated using a Gaussian. We call this a variational Gaussian (VG) potential. (It is also possible to have a conditional VG (CVG) potential.) For details, see

A Variational Approximation for Bayesian Networks with Discrete and Continuous Latent Variables.
Kevin Murphy. UAI '99 (Uncertainty in AI).

Each node can be discrete or continuous, hidden or observed. For a node Y with a single parent X, this yields 16 possible relationships. For each possibility, we list the legal CPDs for the child, and the resulting potential that will be used to represent the joint P(X, Y). When we say that a CPD must be tabular, any form that can be converted to a table (such as a decision tree, noisy-or, etc.) can be used instead. When we say that a CPD must be softmax, a logistic function can be used instead, if the child is binary. However, if we specify that the CPD must be logistic, then a softmax function cannot be used, since we do not how to create a VG approximation to the softmax function.

Note that when the parent is observed, we need to represent P(Y | X = x), where x is the value of X implied by the evidence. When the child is observed, we need to represent P(Y=y | X). When both are observed, we need to represent P(Y=y | X=x), which is just a scalar. A scalar can be represented by a 1 element table or a Gaussian (with 0 dimensional mean and variance: the value is encoded in the normalizing coefficient).

If a node has multiple continuous parents, they can be stacked together into a single vector-valued parent. If a node has multiple discrete parents, we can create a mixture distribution. In the current implementation, only discrete and Gaussian nodes can have multiple discrete parents. (Thus we cannot represent a mixture of softmaxes, although it wouldn't be hard to add this feature.)

parent child parent child
observed observed discrete discrete CPD potential
0 0 0 0 linear Gaussian G
0 0 0 1 logistic CVG
0 0 1 0 mix. Gaussian CG
0 0 1 1 tabular T
0 1 0 0 linear Gaussian G
0 1 0 1 logistic VG
0 1 1 0 mix. Gaussian T
0 1 1 1 tabular T
1 0 0 0 linear Gaussian G
1 0 0 1 softmax T
1 0 1 0 mix. Gaussian G
1 0 1 1 tabular T
1 1 0 0 linear Gaussian scalar
1 1 0 1 softmax scalar
1 1 1 0 mix. Gaussian scalar
1 1 1 1 tabular scalar

Learning

Parameter learning uses the generalized EM algorithm in batch mode as follows. First we initialize the expected sufficient statistics (ess) of each adjustable node in the network. For a Gaussian node, the ess are the sample mean and variance; for a multinomial, the ess are just counts; and for a softmax, the ess are the data themselves (since the softmax distribution is not in the exponential family).

Then we enter as evidence each case in the training set, and compute the posterior probabilities of each family in the network using an inference algorithm. These posteriors are passed into each node, so it can update its ess.

Finally, we ask each node to compute the maximum likelihood estimate of its parameters given its ess. For a Gaussian node, the MLE is computed using (weighted) linear regression; for a multinomial, the MLE is computed by normalizing the table of counts; and for a softmax, the MLE is computed using iteratively reweighted least squares (IRLS), as implemented in netlab.

The following papers spell out some of the details of these MLE calculations, including issues such as parameter tieing and priors.

Gaussians: Fitting a constrained conditional Gaussian distribution. Kevin Murphy, UC Berkeley tech. report, 1998.
Multinomials: A tutorial on learning with Bayesian networks. D. Heckerman, Microsoft Research tech. report, 1995.
Softmax: Hierarchical mixtures of experts and the EM algorithm. M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994, appendix A.
General: Operations for Learning with Graphical Models. W. L. Buntine, J. AI Research, 159--225, 1994.