EM algorithm is, intentionally or unintentionally, widely used in various areas. For example, the common method for fitting GMM models for clustering is exactly EM. Baum-Welch algorithm for HMM is another example. However, there is not yet an article that points out that the famous policy iteration algorithm in reinforcement learning is also a kind of EM.
The basic form of EM algorithm is given some observed data , a set of unobserved random variable , and a likelihood function , the and that maximizing the marginal likelihood of the observed data and be found by iteratively applying the two steps:
EM algorithm can be seen of as a special case of coordinate ascent. It is guaranteed to converge in a local maximum.
To show this, let be the random variable of the discounting reward when sample actions using from state . The problem is to infer and , so that the expected reward is maximized. is the initial state. Don't worry if you do not have one, just add the pseudo state which have only one action and may transfer to any other state with no immediate reward.
First we write the process in math:
Directly apply the EM algorithm, we get:
To solve the E step, we can expand it using the Law of the unconscious statistician:
This is exactly the policy evaluation, which is a linear equation that usually be solved by iteration. And the M step, of course, corresponds to the policy improvement. The exact we choose to parameterize is not important, since we know after the M step, , i.e., the greedy policy.