This toolbox supports inference and learning for HMMs with discrete outputs (dhmm's), Gaussian outputs (ghmm's), or mixtures of Gaussians output (mhmm's). The Gaussians can be full, diagonal, or spherical (isotropic). It also supports discrete inputs, as in a POMDP. The inference routines support filtering, smoothing, and fixed-lag smoothing.
HMMs and some common variants (e.g., input-output HMMs) can be concisely explained using the language of Bayesian networks, as we now demonstrate.
![]() | ![]() |
Consider the Bayesian network in Figure (a), which represents a hidden Markov model (HMM). (Circles denote continuous-valued random variables, squares denote discrete-valued, clear means hidden, shaded means observed.) This encodes the joint distribution
P(Q,Y) = P(Q_1) P(Y_1|Q_1) P(Q_2|Q_1) P(Y_2|Q_2) ...For a sequence of length T, we simply ``unroll'' the model for T time steps. In general, such a dynamic Bayesian network (DBN) can be specified by just drawing two time slices (this is sometimes called a 2TBN) --- the structure (and parameters) are assumed to repeat.
The Markov property states that the future is independent of the past given the present, i.e., Q_{t+1} \indep Q_{t-1} | Q_t. We can parameterize this Markov chain using a transition matrix, M_{ij} = P(Q_{t+1}=j | Q_t=i), and a prior distribution, \pi_i = P(Q_1 = i).
We have assumed that this is a homogeneous Markov chain, i.e., the parameters do not vary with time. This assumption can be made explicit by representing the parameters as nodes: see Figure(b): P1 represents \pi, P2 represents the transition matrix, and P3 represents the parameters for the observation model. If we think of these parameters as random variables (as in the Bayesian approach), parameter estimation becomes equivalent to inference. If we think of the parameters as fixed, but unknown, quantities, parameter estimation requires a separate learning procedure (usually EM). In the latter case, we typically do not represent the parameters in the graph; shared parameters (as in this example) are implemented by specifying that the corresponding CPDs are ``tied''.
An HMM is a hidden Markov model because we don't see the states of the Markov chain, Q_t, but just a function of them, namely Y_t. For example, if Y_t is a vector, we might define P(Y_t=y|Q_t=i) = N(y; \mu_i, \Sigma_i). A richer model, widely used in speech recognition, is to model the output (conditioned on the hidden state) as a mixture of Gaussians. This is shown below.
![]() |
Some popular variations on the basic HMM theme are illustrated below (from left to right: an input-output HMM, a factorial HMM, a coupled HMM). (In the input-output model, the CPD P(Q|U) could be a softmax function, or a neural network.) If we have software to handle inference and learning in general Bayesian networks (such as my Bayes Net Toolbox), all of these models becomes trivial to implement.
![]() | ![]() | ![]() |
This software is designed for simple HMMs. If you want to use these more complicated alternatives, use my Bayes Net Toolbox.
O = 3; Q = 2; prior0 = normalise(rand(Q,1)); transmat0 = mk_stochastic(rand(Q,Q)); obsmat0 = mk_stochastic(rand(Q,O));Now we sample nex=10 sequences of length T=10 each from this model, to use as training data.
data = sample_dhmm(prior0, transmat0, obsmat0, T, nex);Now we make a random guess as to what the parameters are,
prior1 = normalise(rand(Q,1)); transmat1 = mk_stochastic(rand(Q,Q)); obsmat1 = mk_stochastic(rand(Q,O));and improve our guess using 5 iterations of EM...
max_iter = 5; [LL, prior2, transmat2, obsmat2] = learn_dhmm(data, prior1, transmat1, obsmat1, max_iter);LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.
loglik = log_lik_dhmm(data, prior, transmat, obsmat)Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.
To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.
B = mk_dhmm_obs_lik(data, obsmat)Then you can use
[path, loglik] = viterbi_path(prior, transmat, B)
O = 2; T = 5; nex = 10; data = randn(O,T,nex);Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.
M = 2; Q = 2; left_right = 0; [prior0, transmat0, mixmat0, mu0, Sigma0] = init_mhmm(data, Q, M, 'diag', left_right);Finally, let us improve these parameter estimates using EM.
max_iter = 5; [LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ... learn_mhmm(data, prior0, transmat0, mu0, Sigma0, mixmat0, max_iter);Since EM only finds a local optimum, good initialisation is crucial. The procedure implemented in init_mhmm is very crude, and is probably not adequate for real applications...
loglik = log_lik_mhmm(data, prior, transmat, mixmat, mu, Sigma);
B = mk_mhmm_obs_lik(data, mu, Sigma, mixmat);Finally, use
[path, loglik] = viterbi_path(prior, transmat, B);
[LL, prior1, transmat1, mu1, Sigma1] = ... learn_mhmm(data, prior0, transmat0, mu0, Sigma0, max_iter);The classification routine is called as follows:
loglik = log_lik_ghmm(data, prior, transmat, mu, Sigma);The likelihood routine is called as
B = mk_ghmm_obs_lik(data, mu, Sigma);