Question 2 [10 marks]

Question 2 [10 marks]

Consider the game domain of Assignment 2.

Grid Game World

The robot can be at one of the 25 locations on the grid. There can be a treasure on one of the circles at the corners. When the robot reaches the corner where the treasure is, it collects a reward of 10, and the treasure disappears. When there is no treasure, at each time step, there is a probability P₁=0.2 that a treasure appears, and it appears with equal probability at each corner. The robot knows its position and where the treasure is.

There are monsters at the squares marked with an X. Each monster randomly and independently, at each time, checks if the robot is on their square. If the robot is on the square when the monster checks, it has a reward of -10 (i.e., it loses 10 points). At the centre point the monster checks at each time with probability p₂=0.4, at the other 4 squares marked with an X, the monsters check at each time with probability p₃=0.2.

The robot has 8 actions corresponding to the 8 neighbouring squares. The diagonal moves are noisy; there is a p₄=0.6 probability of going in the direction chosen and an equal chance of going to each of the 4 neighboring squares that are closest to the desired direction. The vertical and horizontal moves are also noisy; there is a probability p₅=0.8 chance of going in the requested direction and an equal chance of going to one on the adjacent diagonal squares. For example, the actions up-left and up have the following result:

Robot movements

If the action would result in crashing in a wall, the robot has a reward of -2 (i.e., loses 2) and does not move.

There is a discount factor of p₆=0.9.

Assume that the rewards are immediate on entering a state (i.e., if the robot enters a state where there is a monster, it gets the (negative) reward on entering the state, and if the robot enters the state where there is a treasure it gets the reward on entering the state, even if the treasure arrives at the same time). [Note that there are other possible ways to model the game, but for this exam you are to assume these dynamics and reward model.]

[10 marks]
Suppose we are using the inefficient state space representation with 125 states.
Suppose we are running value iteration, and have the following values for each state: [The numbers in the square represent the value of that state, and where empty squares have a zero value. It is irrelevant to this question how these values got there]:

where the treasure is at the circle. There are also states for the treasures at the other four corners.
Consider the next step of value iteration. For state s₁₃, which is marked by * in the above figure, and the action a₂ which is "up", what value is assigned to Q[s₁₃,a₂] on the next iteration of value iteration? You need to show all working, but don't need to do any arithmetic (i.e., leave it as an expression). Explain each terms in your expression.

Question 2 [10 marks]