Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).
9.5.2 Value of an Optimal Policy
Let Q*(s,a), where s is a state and a is an action, be the expected value of doing a in state s and then following the optimal policy. Let V*(s), where s is a state, be the expected value of following an optimal policy from state s.
Q* can be defined analogously to Qπ:
Q*(s,a) = ∑s' P(s'|s,a) (R(s,a,s')+ γV*(s')).
V*(s) is obtained by performing the action that gives the best value in each state:
V*(s) =maxa Q*(s,a).
An optimal policy π* is one of the policies that gives the best value for each state:
π*(s) = argmaxa Q*(s,a).
Note that argmaxa Q*(s,a) is a function of state s, and its value is one of the a's that results in the maximum value of Q*(s,a).