Grid-world ad Russell

An agent is situated in the environment shown in Figure 17.1. It must execute a sequence of actions, the environment terminates when the agents reaches one of the states marked as +1 or -1. In each location, the available actions are called North, East, South and West. The agent knows which state it is initially and the effects of all of its actions on the state of the world.

The actions are unreliable. Each action achieves its intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction. For example, from the start square (1,1), the action North moves the agent to (1,2) with probability 0.8, but with probability 0.1, it moves East to (2,1), and with probability 0.1 it moves West, bump into the wall, and stays in (1,1).

The utility function is as follows. Other than the terminal states, there is no indication of a state's utility. The utility function is based on a sequence of states – an environment history – rather than on a single state. The utility for a sequence will be the terminal state minus 1/25th the length of the sequence.

The environment is accessible, the agent's percept at each step identifies the state it is in.

The implementation in MDP Toolbox: the grid-world contains 12 states (plus the absorb state), the states are numbered from the top-left corner to the bottom-right corner (thus, the square (1,1) is encoded by 3). The agent is able to move North, East, South or West, numbered 1, 2, 3, 4.

Artificial Intelligence: A Modern Approach, the chapter on Making Complex Decisions.

Note: MDP Toolbox in demo_russell works with a bit different game plan indexing than the AIMA book. The field (1,1), resp. 1 in the one-dimensional index, is understood as the most upper-left filed, the field with +1 reward is indexed (1,4), resp. 10.