Grid-world ad Sutton

Our 5×5 grid world domain contains 25 states, numbered 1, 2, 3, 4,…, 24, 25. The agent is able to move North, East, South or West, numbered 1, 2, 3, 4. The actions are deterministic, most transitions give the agent zero reward, except all actions from state 6 move the agent directly to state 10 with a reward of 10, and all actions from state 16 move the agent directly to state 18 with a reward of 5. Bumps into the wall cost the agent -1 in reward and move the agent nowhere.

The process is infinite. There are no terminal states, no episodes. We use a discount factor of 0.9.

The world is shown in the left figure, the optimal policy is shown in the right figure.

Sutton & Barto: Reinforcement Learning, p. 79.