Table of Contents

RLProblem

You will use the kuimaze2.RLProblem environment when learning the best strategy for an unknown MDP using reinforcement learning methods (reinforcement learning). It is used in the fourth compulsory task 11-RL.

Public Interface

After creating an instance of the RLProblem class (see Usage), you can use the following methods:

Usage

The RLProblem environment is created the same way as MDPProblem, but the usage is different.

Environment import:

>>> from kuimaze2 import Map, RLProblem

Creating a map to initialize the environment:

>>> MAP = "SG"
>>> map = Map.from_string(MAP)

Creating a deterministic environment with graphical display:

>>> env1 = RLProblem(map, graphics=True)

Creating a non-deterministic environment (specifying the probabilities of where the agent will actually move):

>>> env2 = RLProblem(map, action_probs=dict(forward=0.8, left=0.1, right=0.1, backward=0.0))

List of all valid states in the environment:

>>> env2.get_states()
[State(r=0, c=0), State(r=0, c=1)]

List of all actions that can be performed in some state of the environment:

>>> env2.get_action_space()
[<Action.UP: 0>, <Action.RIGHT: 1>, <Action.DOWN: 2>, <Action.LEFT: 3>]

A randomly selected action may also be useful:

>>> env2.sample_action()  # The result can be any of the possible actions.
<Action.UP: 0>

The step method attempts to perform the selected action in the environment:

>>> env2.step(env2.sample_action())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\kuimaze2\rl.py", line 60, in step
    raise NeedsResetError(
kuimaze2.exceptions.NeedsResetError: RLProblem: Episode terminated. You must call reset() first.
As you can see, before the first use of the step() method, you need to reset the environment.

Calling the reset() method will return the initial state of the agent for the given episode:

>>> state = env2.reset()
>>> state
State(r=0, c=0)

Now we can call the step() method:

>>> action = env2.sample_action()
>>> action
<Action.DOWN: 2>
>>> new_state, reward, episode_finished = env2.step(action)
>>> new_state
State(r=0, c=0)
>>> reward
-0.04
>>> episode_finished
False
We tried to perform the Action.DOWN action, but we probably hit a wall, because the new state new_state is identical to the original one. We received an immediate reward of -0.04 for performing the action and we see that the episode has not ended, we can continue.

So let's try to make random steps until the episode ends:

>>> while not episode_finished:
...     action = env2.sample_action()
...     new_state, reward, episode_finished = env2.step(action)
...     print(f"{state=} {action=} {reward=} {new_state=} {episode_finished=}")
...     state = new_state
...
state=State(r=0, c=0) action=<Action.DOWN: 2> reward=-0.04 new_state=State(r=0, c=0) episode_finished=False
state=State(r=0, c=0) action=<Action.RIGHT: 1> reward=-0.04 new_state=State(r=0, c=1) episode_finished=False
state=State(r=0, c=1) action=<Action.UP: 0> reward=1.0 new_state=None episode_finished=True

Another call to the step() method would again throw an exception. The episode ended, we want to start a new one, so we need to call reset() again.