econet | Econet / ExpectationMaximizationAlgorithm

Andrius Kulikauskas: I am trying to master the

Expectation-maximization algorithm

A.P.Dempster, N.M.Laird, D.B.Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm 1976

Videos

Karl Friston. Learning and inference in the brain.

In density learning, representational learning has two components that are framed in terms of expectation maximisation (EM, Dempster, Laird, & Rubin, 1977). Iterations of an E-step ensure the recognition approximates the inverse of the generative model and the M-step ensures that the generative model can predict the observed inputs. Probabilistic recognition proceeds by using {$q(v; u,\phi)$} to determine the probability that {$v$} caused the observed sensory inputs. EM provides a useful procedure for density estimation that helps relate many different models within a framework that has direct connections with statistical mechanics. Both steps of the EM algorithm involve maximising a function of the densities that corresponds to the negative free energy in physics.

This objective function comprises two terms. The first is the expected log likelihood of the inputs under the generative model. The second term is the Kullback–Leibler (KL) divergence between the approximating and true recognition densities. Critically, the KL term is always positive, rendering F a lower bound on the expected log likelihood of the inputs. Maximising F encompasses two components of representational learning: (i) it increases the likelihood of the inputs produced by the generative model and (ii) minimises the discrepancy between the approximate recognition model and that implied by the generative model. The E-step increases F with respect to the recognition parameters {$\phi$}; ensuring a veridical approximation to the recognition distribution implied by the generative parameters {$\theta$}. The M-step changes {$\theta$}, enabling the generative model to reproduce the inputs

{$\mathbf{E} \;\phi = \underset{\phi}{\max} F$}

{$\mathbf{M} \;\theta = \underset{\theta}{\max} F$}

There are a number of ways of motivating the free energy formulation in Eq. (4). A useful one, in this context, rests upon the problem posed by non-invertible models. This problem is finessed by assuming it is sufficient to match the joint probability of inputs and causes under the generative model {$p(u,v;\theta)=p(u|v;\theta)p(v;\theta)$} with that implied by recognising the causes of inputs encountered {$p(u,v;\phi)=q(v;u,\phi)p(u)$}. Both these distributions are well defined even when {$p(v|u;\theta)$} is not easily parameterised. This matching minimises the divergence