Table of Contents

4. Reinforcement Learning

Your task is to implement the Q-learning algorithm to find the best strategy in an environment about which you have only incomplete information. You can only perform available actions and observe their effects (reinforcement learning).

Specifications

In the rl_agent.py module, implement the RLAgent class. The class must implement the following interface:

method input parameters output parameters explanation
__init__ env: RLProblem, gamma: float, alpha: float none Agent initialization.
learn_policy none Policy Returns the best strategy, i.e., a dictionary of pairs (state, action).

Note: The environment (env) is an instance of the RLProblem class. The initialization of the environment and visualization methods are the same as for MDPProblem, but working with the environment is different. Executing an action is necessary to learn anything about the environment. We do not have a map and the environment can only be explored using the main method env.step(action). The environment-simulator knows what the current state is.

How to

How will the evaluation script call your code?

Your code will be called by the evaluation script approximately like this:

import kuimaze2
import rl_agent
 
env = kuimaze2.RLProblem(...)  # here the script creates the environment
 
# Calls of your code
agent = rl_agent.RLAgent(env, gamma, alpha)
policy = agent.learn_policy()  # 20 second limit
 
# Evaluation of one episode using your policy
state = env.reset()
episode_finished = False
while not episode_finished:
  action = policy[state]
  next_state, reward, episode_finished = env.step(action)
  total_reward += reward
  state = next_state

Q-function Representation

During the implementation, you will probably need to work with the Q-function. In our discrete world, it will take the form of a table. This table can be represented in various ways, for example:

Q-function Initialization

In real RL tasks, we do not have a “map” of the environment and often do not even know the set of all states. RL then never ends because we can never be sure that we have already reached all reachable states (and that we have learned all q-values well enough). But in our task, RL must end and you must return a complete strategy, i.e., the best action for each state. Therefore, there must be a way to find out the set of all states. The list of all valid states can be obtained using the get_states() method.

The example_rl.py script already contains the initialization of q_table in the form of a dictionary of dictionaries. If you choose a different representation of the q-value table, the initialization needs to be appropriately adjusted.

Submission

Evaluation

Familiarize yourself with the evaluation and scoring of the task.