EM algorithm is, intentionally or unintentionally, widely used in various areas. For example, the common method for fitting GMM models for clustering is exactly EM. Baum-Welch algorithm for HMM is another example. However, there is not yet an article that points out that the famous policy iteration algorithm in reinforcement learning is also a kind of EM.

The basic form of EM algorithm is given some observed data $X$, a set of unobserved random variable $Z$, and a likelihood function $L(θ; X, Z) = P(X, Z|θ)$, the $θ$ and $Z$ that maximizing the marginal likelihood of the observed data $L(θ; X)$ and be found by iteratively applying the two steps:

**E step: **$Q(θ|θ^{(t)}) = \operatorname E_{Z|X,θ^{(t)}}(\log L(θ; X,Z))$

**M step: **$θ^{(t+1)} = \mathop{argmax}\limits_θ Q(θ|θ^{(t)})$

EM algorithm can be seen of as a special case of coordinate ascent. It is guaranteed to converge in a local maximum.

TODO.

To show this, let $V_{s;Ω_θ}$ be the random variable of the discounting reward when sample actions using $Ω_θ(s)$ from state $s$. The problem is to infer $V_{s;Ω_θ}$ and $θ$, so that the expected reward $\operatorname E(V_{s_0;Ω_θ})$ is maximized. $s_0$ is the initial state. Don't worry if you do not have one, just add the pseudo state which have only one action and may transfer to any other state with no immediate reward.

First we write the process in math:

$\begin{aligned} a &\sim Ω_θ(s) \\ s' &\sim P(s'| s, a) \\ V_{s;Ω_θ} &= R(s'| s, a) + γV_{s';Ω_θ} \end{aligned}$Directly apply the EM algorithm, we get:

**E step: **$Q(θ|θ^{(t)}) = \operatorname E(V_{s_0;Ω_θ})$

**M step: **$θ^{(t+1)} = \mathop{argmax}\limits_θ Q(θ|θ^{(t)})$

To solve the E step, we can expand it using the Law of the unconscious statistician:

$\newcommand\E{\operatorname E} \begin{aligned} \E(V_{s;Ω_θ}) &= \E_{a,s'|s;Ω_θ}[R(s'| s, a) + γV_{s';Ω_θ}] \\ &= \sum_{s', a}P(a | s; Ω_θ)P(s'| s, a)[R(s'| s, a) + γV_{s';Ω_θ}] \end{aligned}$This is exactly the policy evaluation, which is a linear equation that usually be solved by iteration. And the M step, of course, corresponds to the policy improvement. The exact $θ$ we choose to parameterize $Ω$ is not important, since we know after the M step, $Ω_θ(s) = \mathop{argmax}_a \operatorname E(V_{s;Ω_θ} | a)$, i.e., the greedy policy.