The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster , Nan Laird , and Donald Rubin in the Journal of the Royal Statistical Society (see reference below). They pointed out that method had been "proposed many times in special circumstances" by other authors, but the 1977 paper generalized the method and developed the theory behind it.
EM is frequently used for Data Clustering in Machine Learning and Computer Vision .
In Psychometrics , EM is almost indispensable for estimating item parameters and latent abilities of Item Response Theory models.
With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price and manage risk of a portfolio.
The EM algorithm (and its faster variant OS-EM ) is also widely used in Medical Image reconstruction, especially in Positron Emission Tomography and Single Photon Emission Computed Tomography .
Various 1D, 2D and 3D demonstrations of EM together with Mixture Modeling are provided as part of the paired SOCR activities and applets. These applets and activities show empirically the properties of the EM algorithm for parameter estimation in diverse settings.
Let denote incomplete data consisting of values of observable variables, and let denote the missing data. Together, and form the complete data. can either be actual missing measurements or a hidden variable that would make the problem easier if its value were known. For instance, in Mixture Model s, the likelihood formula would be much more convenient if mixture components that "generated" the samples were known (see example below).
Let be the joint , that is, it can be thought of as a function of .
Further, note that the Conditional Distribution of the missing data given the observed can be expressed as:
|
|   |
"http://wwwinformationdelightinfo/information/entry/Bayes'_theorem" class="copylinks">Bayes Rule and the Law Of Total Probability (This formulation only requires knowledge of the observation likelihood given the unobservable data <math>p(\mathbf y\mathbf z, heta)</math> and the probability of the unobservable data <math>p(\mathbf z heta)</math>)
|
|   |
"" class="copylinks" target="_blank">\log p \left(\mathbf y, \mathbf z \,\, heta
ight) \Big \mathbf y
ight ,
|
|   |
"" class="copylinks" target="_blank">\log p \left(\mathbf y, \mathbf z \,\, heta
ight) \Big \mathbf y
ight
|
|   |
i\mathbf y_j, heta_t)
|
|   |
rac{p(z_j=i, \mathbf y_j heta_t)}{p(\mathbf y_j heta_t)}
|
|   |
rac{p(\mathbf y_jz_j=i, heta_t) p(z_j=i heta_t)}{\sum_{k=1}^n p(\mathbf y_j z_j=k, heta_t) p(z_j=k heta_t)}
|
|   |
E_{z} \left \ln \prod_{j=1}^m p \left(\mathbf y_j, \mathbf z heta
ight) \Big \mathbf y_j
ight \
|
|   |
E_{z} \left \sum_{j=1}^m \ln p \left(\mathbf y_j, \mathbf z heta
ight) \Big \mathbf y_j
ight \
|
|   |
\sum_{j=1}^m E_{z} \left \ln p \left(\mathbf y_j, \mathbf z heta
ight) \Big \mathbf y_j
ight \
|
|   |
\sum_{j=1}^m \sum_{i=1}^n p \left(z_j=i \mathbf y_j, heta_t
ight) \ln p\left(z_j=i, \mathbf y_j heta
ight) \
|
|   |
\sum_{j=1}^m \sum_{i=1}^n p(z_j=i \mathbf y_j, heta_t) \ln \left( p(\mathbf y_j z_j=i, heta) p(z_j=i heta)
ight)
|
|   |
1}^{n} p(z_j=i heta) = 1
|
|   |
\left( \sum_{j=1}^m \sum_{i=1}^n p(z_j=i \mathbf y_j, heta_t) \left( - rac{1}{2} \ln (2\pi) - rac{1}{2} \ln \left \sigma_i
ight - rac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) + \ln p(z_j=i heta)
ight)
ight) - \lambda \left( \sum_{i=1}^{n} p(z_j=i heta) - 1
ight)
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \left( - rac{\partial}{\partial \mu_i} rac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i)
ight) \
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \left( - rac{1}{2}(\sigma_i^{-1} +\sigma_i^{-T})(\mathbf y_j - \mathbf \mu_i)(-1)
ight) \
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \left( \sigma_i^{-1}(\mathbf y_j - \mathbf \mu_i)
ight) \
|
|   |
1}^m p(z_j=i \mathbf y_j, heta_t) \sigma_i^{-1} \mathbf \mu_i
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \sigma_i^{-1} \mathbf y_j \
|
|   |
1}^m p(z_j=i \mathbf y_j, heta_t)
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \mathbf y_j \
|
|   |
rac{\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \mathbf y_j}{\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t)} \
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \left( - rac{\partial}{\partial \sigma_i} rac{1}{2} \ln \left \sigma_i
ight - rac{\partial}{\partial \sigma_i} rac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i)
ight) \
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \left( - rac{1}{2} \sigma_i^{-T} + rac{1}{2} \sigma_i^{-T}(\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-T}
ight) \
|
|   |
1}^m p(z_j=i \mathbf y_j, heta_t) \sigma_i^{-1}
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} \
|
|   |
1}^m p(z_j=i \mathbf y_j, heta_t)
|
|   |
\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T \
|
|   |
rac{\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) (\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T}{\sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t)} \
|
|   |
i heta)}
|
|   |
\left( \sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) rac{\partial \ln p(z_j=i heta)}{\partial p(z_j=i heta)}
ight) - \lambda \left( rac{\partial p(z_j=i heta)}{\partial p(z_j=i heta)}
ight) \
|
|   |
\left( \sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) rac{1}{p(z_j=i heta)}
ight) - \lambda \
|
|   |
1}^m p(z_j=i \mathbf y_j, heta_t) rac{1}{p(z_j=i heta)}
|
|   |
i heta)
|
|   |
rac{1}{\lambda} \sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \
|
|   |
1}^{n} p(z_j=i heta)
|
|   |
\sum_{i=1}^{n} rac{1}{\lambda} \sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \
|
|   |
\sum_{i=1}^{n} \sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \
|
|   |
i heta)
|
|   |
rac{1}{\lambda} \sum_{j=1}^m p(z_j=i \mathbf y_j, heta_t) \
|