(Reference - see chapter 4 of Murphy's MLPP)
In class, you talked about multivariate mixture of gaussian models fowhere we assumed your dataset was generated according to the following generative process: 1) We sample a class from a Categorical Distribution, $$Cat(y|\theta) = \prod_{j=1}^K \theta_j^{\mathbb{I}(y_j=1)}$$ 2) Given the class, the features of a particular obervation were sampled from a multivariate normal with class-specific mean and covariance. $$ MVN(x|y = c, \mu_c, \Sigma_c) = (2\pi)^{-D/2}|\Sigma|^{-1/2}\exp\left[-\frac{1}{2}(x-\mu_c)^T\Sigma^{-1}_c(x-\mu_c)\right]$$
Today, we're assuming the same generative process, except the we assume that we have the class labels, $y_i$, and we're doing supervised learning.
As a running example, we assumed we had a dataset of heights and weights of men, women and aliens from Mars (who's weight turns out to be inversely proportional to their height so that our plot looks a little less boring...).
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
n = 1000
X = np.zeros((n,2)) # initialize our data
y = np.zeros((n, 3))
true_theta = [0.4, 0.4, 0.2] # class probabilities
#class means
true_means = [
[75, 170], #men
[55, 150], #women
[65, 150] #aliens
]
#class covariances
true_cov = [[[80, 50], [-5, 2]],
[[60, 50], [-8, 2]],
[[60, -50], [8, 2]]
]
for i in xrange(n):
# note - this code would be a lot faster if you vectorized this loop...
# sample a class label
y[i,:] = np.random.multinomial(1, true_theta, size=1).flatten()
c = np.argmax(y[i,:])
# sample the corresponding features.
X[i,:] = np.random.multivariate_normal(true_means[c], true_cov[c], 1).flatten()
plt.figure(figsize=(8,6))
plt.plot(X[y[:,0]==1,0], X[y[:,0]==1,1], 'x', label='Men')
plt.plot(X[y[:,1]==1,0], X[y[:,1]==1,1], 'x', label='Women')
plt.plot(X[y[:,2]==1,0], X[y[:,2]==1,1], 'x', label='Aliens from Mars')
plt.axis('equal')
plt.xlabel('Weight (kg)')
plt.ylabel('Height (cm)')
plt.legend(loc=2)
plt.show()
To fit the parameters of the model, we're going to derive closed-form maximum likelihood estimators (this works out because we're working with Gaussians). For this tutorial, we're going to assume that all of our Gaussians share the same covariance matrix. i.e. $$\Sigma_1 = \Sigma_2 = ... = \Sigma_c = \Sigma$$ (note that this is clearly an inappropriate assumption for the data that we're modeling because our aliens have very different covariance to the men and women - we'll ignore this for now). The resulting model is known as a Linear Discriminant Analysis model (see Murphy - ch. 4).
We can write the log-likelihood function as follows: $$ \mathcal{l}(\mathcal{x} | \theta, \mu_i, \Sigma_i) = \sum_{i=1}^n \log p(x_i | y = y_i,\mu_{y_i}, \Sigma_{y_i} ) + \log p(y_i | \theta) $$
To maximise $\mathcal{l}()$ with respect to $\theta$, we're finding the MLE of a categorical distribution since, $$ \nabla_\theta \mathcal{l}(\mathcal{x} | \theta, \mu_i, \Sigma_i) = \sum_{i=1}^n\nabla_\theta \log p(y_i | \theta) =\sum_{i=1}^n \nabla_\theta \log \prod_{j=1}^K \theta_j^{\mathbb{I}(y_j=1)} =\sum_{i=1}^n \sum_{j=1}^K \nabla_\theta \log \theta_j^{\mathbb{I}(y_j=1)} $$
We have to remember to constrain $\theta$ to sum to 1 which we can do by adding the appropriate legrange multiplier. Our Legrangian is as follows: $$ L(\theta) = \sum_{i=1}^n \sum_{j=1}^K \nabla_\theta \log \theta_j^{\mathbb{I}(y_j=1)} - \lambda (\sum_l \theta_l - 1) $$
$$ \frac{\partial L}{\partial \theta_c} = 0 \Rightarrow \sum_{i=1}^n \mathbb{I}(y_i = c)\frac{1}{\theta_c} = \lambda \Rightarrow n_c = \lambda \theta_c $$Where $n_c$ is the number of examples where $y_i$ is class $c$.
Now summing over all classes, and using the fact that $\sum_c\theta_c = 1$ and $\sum_c n_c = n$, we get: $$ \sum_c n_c = \sum_c\lambda \theta_c \Rightarrow n = \lambda $$ Putting this together, we get that the MLE for $\theta_c$ is, $$ \theta_c = \frac{n_c}{n} $$
In our running example, we could calculate this as follows:
theta_hat = y.mean(axis=0)
print theta_hat
Let's now calculate the MLE of $\mu_c$. $$ \nabla_{\mu_c} \mathcal{l}(\mathcal{x} | \theta, \mu_i, \Sigma_i) = \nabla_{\mu_c} \sum_{i=1}^n \log p(x_i | y = y_i,\mu_{y_i}, \Sigma_{y_i} ) $$ $$ = \sum_{i=1}^n \mathbb{I}(y_i = c) \nabla_{\mu_c}\log p(x_i | y = c,\mu_{c}, \Sigma_{c} ) $$ $$ = \sum_{i=1}^n \mathbb{I}(y_i = c) \nabla_{\mu_c}[\log (2\pi)^{-D/2} + \log |\Sigma|^{-1/2} + \left[-\frac{1}{2}(x_i-\mu_c)^T\Sigma^{-1}_c(x_i-\mu_c)\right] )] $$ $$ = \sum_{i=1}^n \mathbb{I}(y_i = c) \nabla_{\mu_c} \left[-\frac{1}{2}(x_i-\mu_c)^T\Sigma^{-1}_c(x_i-\mu_c)\right] ) $$ $$ = \sum_{i=1}^n \mathbb{I}(y_i = c)\, \Sigma^{-1}_c(x_i-\mu_c) $$ Where the last line follows from the identity $\nabla_s (x-s)^TW(x-s) = -2W(x-s)$ that holds is $W$ is symmetric (see Matrix Cookbook). Setting this equal to zero, we get: $$ \sum_{i=1}^n \mathbb{I}(y_i = c)\, x_i = \mu_c \sum_{i=1}^n \mathbb{I}(y_i = c) = \mu_c n_c $$
$$ \mu_c = \frac{1}{n_c} \sum_{i=1}^n \mathbb{I}(y_i = c)\, x_i $$In code, this would be:
n_men = float((y[:,0] == 1).sum())
mu_men = 1/n_men * X[y[:,0] == 1, :].sum(axis=0)
n_women = float((y[:,1] == 1).sum())
mu_women = 1/n_women * X[y[:,1] == 1, :].sum(axis=0)
n_aliens = float((y[:,2] == 1).sum())
mu_aliens = 1/n_aliens * X[y[:,2] == 1, :].sum(axis=0)
print mu_men, mu_women, mu_aliens
Using the same trace trick and gradient with respect to the log determinant that we saw in class, we can get: $$ \nabla_{\Sigma} \mathcal{l}(\mathcal{x} | \theta, \mu_i, \Sigma_i) = \sum_{i=1}^n \frac{-1}{2}\Sigma +\frac{1}{2}(x_i-\mu_{y_i})(x_i-\mu_{y_i})^T)] $$
Setting this equal to zero, we get: $$ \sum_{i=1}^n \Sigma = \sum_{i=1}^n (x_i-\mu_{y_i})(x_i-\mu_{y_i})^T)] $$
$$ \Sigma = \frac{1}{n} \sum_{i=1}^n (x_i-\mu_{y_i})(x_i-\mu_{y_i})^T)] $$I.e. in linear discriminant analysis, we subtract off the class specific mean from each feature and then compute the sample covariance
Finally, we noticed that we could reparameterize the model as follows: $$ p(y = c|x, \mu_i, \Sigma, \theta) = \frac{p(x|y, \mu_c, \Sigma) * p(y = c|\theta)}{\sum_{c'}p(x|y, \mu_{c'}, \Sigma) * p(y = c'|\theta)} $$
$$ = \frac{(2\pi)^{-D/2}|\Sigma|^{-1/2}\exp\left[-\frac{1}{2}(x-\mu_c)^T\Sigma^{-1}(x-\mu_c)\right]* \theta_{c}}{\sum_{c'}(2\pi)^{-D/2}|\Sigma|^{-1/2}\exp\left[-\frac{1}{2}(x-\mu_{c'})^T\Sigma^{-1}(x-\mu_{c'})\right]* \theta_{c'}} $$$$ = \frac{\exp\left[-\frac{1}{2}(x-\mu_c)^T\Sigma^{-1}(x-\mu_c) + \log\theta_{c}\right]}{\sum_{c'}\exp\left[-\frac{1}{2}(x-\mu_{c'})^T\Sigma^{-1}(x-\mu_{c'}) + \log\theta_{c'}\right]} $$$$ = \frac{\exp\left[-\frac{1}{2}x^T\Sigma^{-1}x +\mu_c^T\Sigma^{-1}x -\frac{1}{2}\mu_c^T\Sigma^{-1}\mu_c + \log\theta_{c}\right]}{\sum_{c'}\exp\left[-\frac{1}{2}x^T\Sigma^{-1}x +\mu_{c'}^T\Sigma^{-1}x -\frac{1}{2}\mu_{c'}^T\Sigma^{-1}\mu_{c'} + \log\theta_{c'}\right]} $$$$ = \frac{\exp(-\frac{1}{2}x^T\Sigma^{-1}x)\exp\left[\mu_c^T\Sigma^{-1}x -\frac{1}{2}\mu_c^T\Sigma^{-1}\mu_c + \log\theta_{c}\right]}{\sum_{c'}\exp(-\frac{1}{2}x^T\Sigma^{-1}x)\exp\left[\mu_{c'}^T\Sigma^{-1}x -\frac{1}{2}\mu_{c'}^T\Sigma^{-1}\mu_{c'} + \log\theta_{c'}\right]} $$$$ = \frac{\exp\left[\mu_c^T\Sigma^{-1}x -\frac{1}{2}\mu_c^T\Sigma^{-1}\mu_c + \log\theta_{c}\right]}{\sum_{c'}\exp\left[\mu_{c'}^T\Sigma^{-1}x -\frac{1}{2}\mu_{c'}^T\Sigma^{-1}\mu_{c'} + \log\theta_{c'}\right]} $$If we let $\beta_c = \mu_c^T\Sigma^{-1}$ and $\alpha_c = -\frac{1}{2}\mu_c^T\Sigma^{-1}\mu_c + \log\theta_{c}$, we can re-write this as:
$$ = \frac{\exp\left(\beta_c^Tx + \alpha_c\right)}{\sum_{c'} \exp\left(\beta_{c'}^Tx + \alpha_{c'} \right)} $$Which is exactly the functional form of multi-class logistic regression. So we've built a generative model, but our assumptions on $\Sigma$ have lead to a linear decision boundry that relates to a descriminiative model. See section 8.6 of Murphy if you're interested in reading more about this.