Question 3 [25 marks]
Suppose that the agent steps through the state space in the order of steps given in the diagram below, (i.e., going from s1 to s2 to s3 to s4 to s5), each time doing a "right" action.
Note that in this figure, the numbers represent the order that the robot visited the states. You can assume that this is the first time the robot has visited any of these states.
- [5 marks] Suppose a monster did not appear at any time during any of these experiences. What Q-values are updated during Q-learning based on this experience? Explain what values they get assigned. You should assume that alphak=1/k.
- [10 marks] Suppose that, at some later time, the robot revisits
the same states: s1 to s2 to s3 to s4 to s5, and hasn't visited any of these
states in between (i.e, this is the second time visiting any of these states). Suppose this time, the monster appears
so that the robot gets a penalty. What Q-values have their values
changed? What are their new values?
- [5 marks] Give a qualitative description of how SARSA(lambda)
would be different given the experiences of part (a). That is, what
Q-values get changed, when. You don't have to say what their values
get changed to.
- [5 marks] In assignment 3, you investigated using alphak=10.0/(9.0+k).
Explain why it is of this form (e.g., why 9+k on the bottom?) Why does it
work so well?