Chat with us    Thank you, Marcus! I have added an entry on the Caledonian antisyzygy to the Theory Translator.
     
Econet / FreeEnergyPrinciple     Search:

Active Inference

Andrius: I am writing out my understanding.

Free Energy Principle

Consider the correspondence of a world of evidence (known through our senses) and a world of causes (not known but inferred).

  • {$y$} is the evidence What, what is known, what the answering mind observers, the observation, for example, "jumping" or "not moving".
  • {$x$} is the cause How, what is not known, what the questioning mind supposes, what is the subject of belief or hypothesis, for example, "a frog" or "an apple".
    • {$x$} is an estimate of features, of an agent

We start with the prior belief {$P(x)$} regarding cause {$x$}. Given the evidence {$y$}, we want to calculate the new belief, the posterior belief {$P(x|y)$} regarding cause {$x$}.

We conflate the two worlds by considering them both in terms of probabilities.

  • {$P(x)$} is the probability of the cause. It is the prior belief. (Regarding How)
  • {$P(x|y)$} is the probablity of the cause given the evidence. It is the posterior belief. (Regarding Why)
  • {$P(y)$} is the probability of the evidence. It is called the marginal probability or the model evidence. (Regarding What)
  • {$P(y|x)$} is the probability of the evidence given the cause. It is called the likelihood. (Regarding Whether)
  • {$P(x,y)$} is the probability of the evidence and the cause.

Bayes's theorem states

  • {$P(x,y)=P(x)P(y|x)=P(y)P(x|y)$}

Marginalization states that summing over all possible {$x$} gives:

  • {$\sum_x P(x,y)=\sum_x P(y)P(x|y)=P(y)\sum_x P(x|y)=P(y)$}

This means that we can calculate the model evidence (the probability of the evidence) by summing the combined probability (for evidence {$y$} and cause {$x$}) over all of the causes {$x$}. But also:

  • {$P(y)=\sum_x P(x,y)=\sum_x P(x)P(y|x)$}

This means that we can calculate the model evidence (the probability of the evidence) by summing over all causes {$x$} the product of the prior belief and the likelihood.

The generative model consists of the prior belief {$P(x)$} and the likelihood {$P(y|x)$}.

  • They yield a sensory output of what we predict to see in the world, which we can compare with what we then actually do see.
  • From them by marginalization we can calculate the model evidence {$P(y)$}.
  • And then using Bayes's theorem we can calculate the posterior belief {$P(x|y)$}. That is the goal!

Variational free energy {$F[Q,y]$}

  • It is a function of the questioning mind's approximation of the posterior belief {$Q$} and the answering mind's evidence {$y$}.
  • It is the sum of the prediction error {$u'-u$} (how much the generative model's outputted prediction {$u'$} differs from the latest sensory data {$u$}) and the deviation {$D_{KL}(v'|v_{prior})$} of the posterior inferred cause {$v'$} from the prior inferred cause {$v$}.
    • {$=\mathbb{E}_Q[\textrm{ln}Q(x|y)-\textrm{ln}P(x)]+\mathbb{E}_Q[-\textrm{ln}P(y|x)]+\textrm{ln}P(y)$}
    • The predicted error is given by {$\mathbb{E}_Q[-\textrm{ln}P(y|x)]+\textrm{ln}P(y)$}. This subtracts from the observation {$y$} the particular cause {$x$}, thus sums over all other causes.
    • The deviation compares, for each cause {$x$}, the approximate posterior belief in that cause {$\textrm{ln}Q(x|y)$}, supposing observation {$y$}, with the prior belief in that cause {$\textrm{ln}P(x)$}.
  • Divergence minus evidence: {$D_{KL}[Q(x)\parallel P(x|y)] - \textrm{ln}P(y)$}. Free energy is minimized when divergence decreases and evidence increases (approaches 1).
  • Divergence plus prediction error.
  • Alternatively, is the complexity minus accuracy: {$D_{KL}[Q(x)\parallel P(x)] - \mathbb{E}_{Q(x)}[\textrm{ln}P(y|x)]$} complexity is the degree the approximation for the posterior belief does not match up with the prior belief, and accuracy is the extent the likelihood overlaps with the approximation for the posterior belief. Free energy is minimized when complexity decreases and accuracy increases.
  • From the physical point of view, is energy minus entropy:
    • {$\sum_xQ(x|y)\textrm{ln}[Q(x|y)] - \sum_xQ(x|y)\textrm{ln}P(x|y)$} which is entropy minus energy
    • {$-\mathbb{E}_{Q(x)}[\textrm{ln}P(y,x)]-H[Q(x)]$}.
    • Comparing with the other definitions, this can be written as minus entropy (which expresses the questioning mind and the approximation) plus energy (which expresses the answering mind and the evidence). Free energy is minimized when energy decreases and entropy increases. According to the second law of thermodynamics, entropy stays the same or increases.
    • This is Helmholtz free energy {$U-TS$}, where {$U$} is the internal energy of the system, {$T$} is the temperature, and {$S$} is the entropy. This measures the useful work obtainable from a closed thermodynamic system at a constant temperature. It thus allows for pressure changes, as with explosives. Whereas Gibbs free energy, relevant for chemical reactions, assumes constant pressure, allowing for temperature changes.

Minimize free energy by adjusting the paramaters {$\phi$}.

{$P$} describes the probabilities given by the first mind, the neural mind, which knows the actuality

{$Q$} approximates {$P$} to calculate the posterior belief

  • {$Q$} describes the probabilities given by the second mind, the conceptual mind, which is modeling the actuality. We have {$P\sim Q$}.
  • {$Q$} is a guess, an inference, that approximates the posterior belief {$Q(v|u)\sim P(v|u)$}.
  • Use approximate posterior {$q$} and learn its parameters (synaptic weights) {$\phi $}.

My ideas

  • Think of {$Q(x)$} as modeling the state of full knowledge of observations. Thus {$Q(y)=1$} and {$Q(x|y)=Q(x)$}. Thus we are comparing {$P(x|y)$} with {$Q(x)$}.
  • In this spirit, we can write:

{$\textrm{ln}P(y)=\textrm{ln}\sum_x P(y,x)\frac{Q(x)}{Q(x)}=\textrm{ln}\mathbb{E}_{Q(x)}\left [\frac{P(y,x)}{Q(x)}\right ]$} and by Jensen's inequality

{$\geq \mathbb{E}_{Q(x)}\left [\textrm{ln}\frac{P(y,x)}{Q(x)}\right ] = -F[Q,y]$} which is the negative of the free energy

Recalling that {$P(x,y)=P(y)P(x|y)$} we have:

{$\textrm{ln}P(x,y)=\textrm{ln}P(y) + \textrm{ln}P(x|y)$} implies

{$\mathbb{E}_{P(x|y)}[\textrm{ln}P(x,y)]=\textrm{ln}P(y)+\mathbb{E}_{P(x|y)}[\textrm{ln}P(x|y)]$}

and comparing {$Q(x)$} with {$P(x|y)$} and recalling the definition of {$F[Q,y]={$\mathbb{E}_{Q(x|y)}[\textrm{ln}\frac{Q(x,y)}{P(x,y}$} we can compare

{$\mathbb{E}_{Q(x)}[\textrm{ln}P(x,y)]= -F[Q,y] + \mathbb{E}_{Q(x)}[\textrm{ln}Q(x)]$}

From this perspective, the free energy {$F[Q,y]$} expresses {$\textrm{ln}\frac{P(x|y)}{P(x,y)}=\textrm{ln}\frac{1}{P(y)}=-\textrm{ln}P(y)$}. The probability {$P(y)\leq 1$} and so {$-\textrm{ln}P(y)\geq 0$}. Minimizing the free energy means increasing the certainty {$P(y)$}. Zero free energy means that {$P(y)=1$} is certain. And this is the aspiration for {$Q(x)$}, that it includes the observation {$y$}.

KL Divergence

  • KL Divergence is also known as (Shannon) Relative Entropy.
  • KL Divergence characterizes information gain when comparing statistical models of inference.
  • KL Divergence expresses the expected excess surprisal from using {$Q$} as a model instead of {$P$} when the actual distribution is {$P$}.
  • KL Divergence is the excess entropy. It is the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution {$Q$} is used, compared to using a code based on the true distribution {$P$}.
  • KL Divergence is a measure of how much a model probability distribution {$Q$} is different from a true probability distribution {$P$}.
  • {$$D_{KL}(P\parallel Q)=\sum_{x\in X}P(x)\textrm{log}\frac{P(x)}{Q(x)}$$}
  • Minimize the Kullback-Leibler divergence ('distance') between {$Q$} and true Posterior {$P$} by changing the parameters {$\phi$}.

Notation: {$u=y$} is the evidence and {$v=x$} is the cause. Bayes theorem lets us calculate the posterior belief: {$P(v|u)=\frac{P(u|v)P(v)}{P(u)}$}

Given an observation {$x$} and a cause {$y$}, we want to minimize the divergence of the conceptual model {$Q(x|y)$} from the sensory data {$P(x|y)$}.

{$D_{KL}[Q(x|y)\parallel P(x|y)]=\sum_x Q(x|y)\textrm{ln} \left [\frac{Q(x|y)}{P(x|y)} \right ]$}definition of KL divergence 
{$=\mathbb{E}_{Q(x|y)}[\textrm{ln}Q(x|y)-\textrm{ln}P(x|y)]$}logarithm product rule, definition of expected valueHelmholtz free energy: entropy minus energy
{$=\mathbb{E}_{Q(x|y)}[\textrm{ln}Q(x|y)-\textrm{ln}P(y|x)-\textrm{ln}P(x)+\textrm{ln}P(y)]$}Bayes's rule 
{$=\mathbb{E}_{Q(x|y)}[\textrm{ln}Q(x|y)-\textrm{ln}P(x)]+\mathbb{E}_{Q(x|y)}[-\textrm{ln}P(y|x)]+\textrm{ln}P(y)$}reorganizedeviation plus prediction error
{$=D_{KL}[Q(x|y)\parallel P(x)]+\mathbb{E}_{Q(x|y)}[-\textrm{ln}P(y|x)]+\textrm{ln}P(y)$}definition of KL divergence, independence of {$P(y)$} from {$x$} 

Note that {$\textrm{ln}P(y)$} is constant with respect to {$x$}.

The free energy is {$D_{KL}[Q(x|y)\parallel P(x)]+\mathbb{E}_Q[-\textrm{ln}P(y|x)]$}. It combines the conceptual discrepancy and the sensory discrepancy.

Conceptual discrepancy

  • {$D_{KL}[Q(v|u)\parallel P(v)]$} {$KL$}-divergence of prior and posterior ... the internal discrepancy in the questioning mind

Sensory discrepancy

  • {$\mathbb{E}_Q[-\textrm{ln}P(u|v)]$} how surprising is the sensory data ... the external discrepancy in the answering mind

{$p_\nu = \frac{e^{-\beta E_\nu}}{Z}$} probability of being in the energy state {$E_\nu$} (Boltzmann distribution)

{$-\textrm{ln}P(x,y)$} energy of the explanation

{$\sum_x Q(x|y)[-\textrm{ln}P(x,y)]$} average energy

{$\sum_x -Q(x|y)\textrm{ln}Q(x|y)$} entropy

Free energy

Andrius: I am studying the different kinds of energy so that I could understand free energy, which is basic for Active Inference.

Internal energy {$U$} has to do with energy on the microscopic scale, the kinetic energy and potential energy of particles.

Heat {$Q$} has to do with energy transfer on a macroscopic scale across a boundary.

See also: Active Inference at Math 4 Wisdom