Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).
9.5.4 Policy Iteration
Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy π0 (an approximation to the optimal policy works best) and carries out the following steps starting from i=0.
- Policy evaluation: determine Vπi(S). The definition of Vπ is a set of |S| linear equations in |S| unknowns. The unknowns are the values of Vπi(S). There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.
- Policy improvement: choose πi+1(s)= argmaxa Qπi(s,a), where the Q-value can be obtained from V using Equation (9.5.1). To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set πi+1(s) to be πi(s) if πi(s) is one of the actions that maximizes Qπi(s,a).
- Stop if there is no change in the policy - that is, if πi+1=πi - otherwise increment i and repeat.
2: Inputs
3: S is the set of all states
4: A is the set of all actions
5: P is state transition function specifying P(s'|s,a)
6: R is a reward function R(s,a,s')
7: Output
8: optimal policy π
9: Local
10: action array π[S]
11: Boolean variable noChange
12: real array V[S]
13: set π arbitrarily
14: repeat
15: noChange ←true
16: Solve V[s] = ∑s'∈S P(s'|s,π[s])(R(s,a,s')+γV[s'])
17: for each s∈S do
18: Let QBest=V[s]
19: for each a ∈A do
20: Let Qsa=∑s'∈S P(s'|s,a)(R(s,a,s')+γV[s'])
21: if (Qsa > QBest) then
22: π[s]←a
23: QBest ←Qsa
24: noChange ←false
25: until noChange
26: return π
The algorithm is shown in Figure 9.16. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.
A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps [using Equation (9.5.1)] and then do an improvement.
The idea behind policy iteration is also useful for systems that are too big to be represented directly as MDPs. Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter a in some context s, which corresponds to the derivative of Q(a,s), can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.