An EM view of policy iteration

EM algorithm is, intentionally or unintentionally, widely used in various areas. For example, the common method for fitting GMM models for clustering is exactly EM. Baum-Welch algorithm for HMM is another example. However, there is not yet an article that points out that the famous policy iteration algorithm in reinforcement learning is also a kind of EM.

Expectation-Maximization (EM) algorithm

The basic form of EM algorithm is given some observed data XX, a set of unobserved random variable ZZ, and a likelihood function L(θ;X,Z)=P(X,Zθ)L(θ; X, Z) = P(X, Z|θ), the θθ and ZZ that maximizing the marginal likelihood of the observed data L(θ;X)L(θ; X) and be found by iteratively applying the two steps:

E step: Q(θθ(t))=EZX,θ(t)(logL(θ;X,Z))Q(θ|θ^{(t)}) = \operatorname E_{Z|X,θ^{(t)}}(\log L(θ; X,Z))

M step: θ(t+1)=argmaxθQ(θθ(t))θ^{(t+1)} = \mathop{argmax}\limits_θ Q(θ|θ^{(t)})

EM algorithm can be seen of as a special case of coordinate ascent. It is guaranteed to converge in a local maximum.

Markov decision process (MDP) and policy iteration (PI)


PI is a kind of EM

To show this, let Vs;ΩθV_{s;Ω_θ} be the random variable of the discounting reward when sample actions using Ωθ(s)Ω_θ(s) from state ss. The problem is to infer Vs;ΩθV_{s;Ω_θ} and θθ, so that the expected reward E(Vs0;Ωθ)\operatorname E(V_{s_0;Ω_θ}) is maximized. s0s_0 is the initial state. Don't worry if you do not have one, just add the pseudo state which have only one action and may transfer to any other state with no immediate reward.

First we write the process in math:

aΩθ(s)sP(ss,a)Vs;Ωθ=R(ss,a)+γVs;Ωθ\begin{aligned} a &\sim Ω_θ(s) \\ s' &\sim P(s'| s, a) \\ V_{s;Ω_θ} &= R(s'| s, a) + γV_{s';Ω_θ} \end{aligned}

Directly apply the EM algorithm, we get:

E step: Q(θθ(t))=E(Vs0;Ωθ)Q(θ|θ^{(t)}) = \operatorname E(V_{s_0;Ω_θ})

M step: θ(t+1)=argmaxθQ(θθ(t))θ^{(t+1)} = \mathop{argmax}\limits_θ Q(θ|θ^{(t)})

To solve the E step, we can expand it using the Law of the unconscious statistician:

E(Vs;Ωθ)=Ea,ss;Ωθ[R(ss,a)+γVs;Ωθ]=s,aP(as;Ωθ)P(ss,a)[R(ss,a)+γVs;Ωθ]\newcommand\E{\operatorname E} \begin{aligned} \E(V_{s;Ω_θ}) &= \E_{a,s'|s;Ω_θ}[R(s'| s, a) + γV_{s';Ω_θ}] \\ &= \sum_{s', a}P(a | s; Ω_θ)P(s'| s, a)[R(s'| s, a) + γV_{s';Ω_θ}] \end{aligned}

This is exactly the policy evaluation, which is a linear equation that usually be solved by iteration. And the M step, of course, corresponds to the policy improvement. The exact θθ we choose to parameterize ΩΩ is not important, since we know after the M step, Ωθ(s)=argmaxaE(Vs;Ωθa)Ω_θ(s) = \mathop{argmax}_a \operatorname E(V_{s;Ω_θ} | a), i.e., the greedy policy.