Question 3 [25 marks]

Question 3 [25 marks]

Suppose that the agent steps through the state space in the order of steps given in the diagram below, (i.e., going from s1 to s2 to s3 to s4 to s5), each time doing a "right" action.

Robot movements

Note that in this figure, the numbers represent the order that the robot visited the states. You can assume that this is the first time the robot has visited any of these states.

[5 marks] Suppose a monster did not appear at any time during any of these experiences. What Q-values are updated during Q-learning based on this experience? Explain what values they get assigned. You should assume that alpha_k=1/k.
[10 marks] Suppose that, at some later time, the robot revisits the same states: s1 to s2 to s3 to s4 to s5, and hasn't visited any of these states in between (i.e, this is the second time visiting any of these states). Suppose this time, the monster appears so that the robot gets a penalty. What Q-values have their values changed? What are their new values?
[5 marks] Give a qualitative description of how SARSA(lambda) would be different given the experiences of part (a). That is, what Q-values get changed, when. You don't have to say what their values get changed to.
[5 marks] In assignment 3, you investigated using alpha_k=10.0/(9.0+k). Explain why it is of this form (e.g., why 9+k on the bottom?) Why does it work so well?

Question 3 [25 marks]