Child | Parent | Distribution |
C | C | linear Gaussian |
C | D | mixture of Gaussians |
D | D | tabular (multinomial), noisy-or, several others |
D | C | logistic (sigmoid), softmax |
However, whether BNT can do inference on the model depends on the details of how the nodes are connected, and on which nodes are observed (see below).
In the current implementation of BNT, you must use the junction tree algorithm for inference if your model is hybrid. (It is possible to extend Pearl's algorithm, and of course sampling algorithms, to the hybrid case, but this has not yet been implemented.)When using the jtree algorithm, the fundamental data structure is the "potential", which is a way of representing joint probability distributions. There must be a way to convert each CPD to a potential, and to incorporate evidence into it. In addition, each potential must support the operations of multiplication, division and summation (marginalization). The following potentials are supported in BNT: Gaussians (G), tables (T), and conditonal Gaussian (CG). Note that we use "weak" marginalization for CG distributions, which is not exact. In addition, the logistic distributions is approximated using a Gaussian. We call this a variational Gaussian (VG) potential. (It is also possible to have a conditional VG (CVG) potential.) For details, see
A Variational Approximation for Bayesian Networks with Discrete and Continuous Latent Variables.
Kevin Murphy.
UAI '99 (Uncertainty in AI).
Each node can be discrete or continuous, hidden or observed. For a node Y with a single parent X, this yields 16 possible relationships. For each possibility, we list the legal CPDs for the child, and the resulting potential that will be used to represent the joint P(X, Y). When we say that a CPD must be tabular, any form that can be converted to a table (such as a decision tree, noisy-or, etc.) can be used instead. When we say that a CPD must be softmax, a logistic function can be used instead, if the child is binary. However, if we specify that the CPD must be logistic, then a softmax function cannot be used, since we do not how to create a VG approximation to the softmax function.
Note that when the parent is observed, we need to represent P(Y | X = x), where x is the value of X implied by the evidence. When the child is observed, we need to represent P(Y=y | X). When both are observed, we need to represent P(Y=y | X=x), which is just a scalar. A scalar can be represented by a 1 element table or a Gaussian (with 0 dimensional mean and variance: the value is encoded in the normalizing coefficient).
If a node has multiple continuous parents, they can be stacked together into a single vector-valued parent. If a node has multiple discrete parents, we can create a mixture distribution. In the current implementation, only discrete and Gaussian nodes can have multiple discrete parents. (Thus we cannot represent a mixture of softmaxes, although it wouldn't be hard to add this feature.)
parent | child | parent | child | ||
observed | observed | discrete | discrete | CPD | potential |
0 | 0 | 0 | 0 | linear Gaussian | G |
0 | 0 | 0 | 1 | logistic | CVG |
0 | 0 | 1 | 0 | mix. Gaussian | CG |
0 | 0 | 1 | 1 | tabular | T |
0 | 1 | 0 | 0 | linear Gaussian | G |
0 | 1 | 0 | 1 | logistic | VG |
0 | 1 | 1 | 0 | mix. Gaussian | T |
0 | 1 | 1 | 1 | tabular | T |
1 | 0 | 0 | 0 | linear Gaussian | G |
1 | 0 | 0 | 1 | softmax | T |
1 | 0 | 1 | 0 | mix. Gaussian | G |
1 | 0 | 1 | 1 | tabular | T |
1 | 1 | 0 | 0 | linear Gaussian | scalar |
1 | 1 | 0 | 1 | softmax | scalar |
1 | 1 | 1 | 0 | mix. Gaussian | scalar |
1 | 1 | 1 | 1 | tabular | scalar |
Then we enter as evidence each case in the training set, and compute the posterior probabilities of each family in the network using an inference algorithm. These posteriors are passed into each node, so it can update its ess.
Finally, we ask each node to compute the maximum likelihood estimate of its parameters given its ess. For a Gaussian node, the MLE is computed using (weighted) linear regression; for a multinomial, the MLE is computed by normalizing the table of counts; and for a softmax, the MLE is computed using iteratively reweighted least squares (IRLS), as implemented in netlab.
The following papers spell out some of the details of these MLE calculations, including issues such as parameter tieing and priors.