SARSA(lambda) on a grid
This applet shows how SARSA(lambda) works for a simple 10x10 grid world. The numbers in the squares shows the Q-values of the square for each action. The blue arrows show the optimal action based on the current value function (when it looks like a star, all actions are optimal). To start, press one of the 4 action buttons.
In this example, there are four rewarding states (apart from the walls), one worth +10 (at position (9,8); 9 across and 8 down), one worth +3 (at position (8,3)), one worth -5 (at position (4,5)) and one -10 (at position (4,8)). In each if these states the agent gets the reward when it carries out an action in that state (when it leaves the state, not when it enters). (These are the same as in the value iteration applet).
There are 4 actions available: up, down, left and right. If it carries out one of these actions, it have a 0.7 chance of going one step in the desired direction and a 0.1 change in going one step in any of the other three directions. If it bumps into the outside wall (i.e., the square computed as above is outside the grid), there is a penalty on 1 (i.e., a reward of -1) and the agent doesn't actually move. When an agent acts in one of the states with positive reward, it is flung, at random, to one of the 4 corners of the grid world (no matter what action it does).
The initial discount rate is 0.9. It is interesting to try the learning at different discount rates (using the "Increment Discount" and "Decrement Discount" buttons, or just typing in the value).
You can control the agent yourself (using the up, left, right, down buttons) or you can step the agent for a number of times. The agent acts greedily the percentage of the time specified and act randomly the rest of the time.
Alpha is always a fixed value. It does not use the counts.
Reset initializes all Q-values to the given "Initial Value".
The commands "Brighter" and "Dimmer" change the contrast (the mapping between non-extreme values and colour). "Grow" and "Shrink" change the size of the grid.
You can get the code: SarsaGUI.java, the GUI, and SarsaCocontroller.java, the Sarsa(lambda)-learning core, Q_Env.java, the environment, and the javadoc. This applet comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions, see the code for more details. Copyright © David Poole, 2003,2004. All rights reserved.