Table of Contents

MDPProblem

You will use the kuimaze2.MDPProblem environment in tasks where your goal is to find the optimal strategy for the Markov Decision Process (MDP). It is used in the third compulsory task 08-MDPs.

Public interface

After creating an instance of the MDPProblem class (see Usage), you can use the following methods:

Usage

The environment is typically used as follows:

Import the environment:

>>> from kuimaze2 import Map, MDPProblem, State

Creating a map to initialize the environment:

>>> MAP = """
S.D
..G
"""
>>> map = Map.from_string(MAP)

Creating an environment, first deterministic:

>>> env1 = MDPProblem(map)

If you want to turn on the graphical display of the environment:

>>> env1 = MDPProblem(map, graphics=True)

If we want to create a non-deterministic environment (and we usually do in the case of MDP), we need to specify with what probability the environment will perform the agent's requested action and with what probabilities it will “slip somewhere else”.

>>> env2 = MDPProblem(map, action_probs=dict(forward=0.8, left=0.1, right=0.1, backward=0.0))

List of all valid states in the environment:

>>> env2.get_states()
[State(r=0, c=0), State(r=0, c=1), State(r=0, c=2), State(r=1, c=0), State(r=1, c=1), State(r=1, c=2)]

Finding out if a state is terminal:

>>> env2.is_terminal(State(0, 0)), env2.is_terminal(State(0, 2))
(False, True)

What rewards are associated with individual states? Rewards are paid out when leaving the state.

>>> env2.get_reward(State(0,0)), env2.get_reward(State(0,2)), env2.get_reward(State(1,2))
(-0.04, 1.0, 1.0)

What actions are available in a state? In our environment, all 4 actions will always be available, but if we hit a wall, we will stay in place.

>>> actions = env2.get_actions(State(0, 0))
>>> actions
[<Action.UP: 0>, <Action.RIGHT: 1>, <Action.DOWN: 2>, <Action.LEFT: 3>]

To which states and with what probability can I get if I perform a certain action in a certain state? In a deterministic environment:

>>> env1.get_next_states_and_probs(State(0, 0), actions[0])
[(State(r=0, c=0), 1.0)]
In a non-deterministic environment, the result will be different:
>>> env2.get_next_states_and_probs(State(0, 0), actions[0])
[(State(r=0, c=0), 0.8), (State(r=0, c=1), 0.1), (State(r=1, c=0), 0.0), (State(r=0, c=0), 0.1)]
Note that some resulting states may appear multiple times in the list because they can be reached by different actions.