Reinforcement Learning (4th assignment)

Reinforcement Learning

The task will again use the kuimaze environment: kuimaze.zip package (updated 2023-04-24).

Implement the learn_policy(env) method in a rl_agent.py file that you will upload to Brute. You'll want to implement the Q-learning algorithm (though you can try other things as an extra such as Direct Evaluation). The goal is to find best possible strategy (policy) that will lead us to the goal. We want the strategy with the highest expected sum of discounted rewards.

The output of the function should be a strategy (policy). Its representation is identical to the previous task, i.e. a dictionary that assigns an action to each state. The actions are marked with the values [0,1,2,3], which corresponds to the actions up, right, down, left (N,E,S,W).

The input (env) is an instance of the HardMaze class. Environment initialization and visualization methods are the same as in MDPMaze, but working with the environment is different. Recall, from the lecture, that action is necessary in order to learn anything about the environment at all. We do not have a map and we can explore the environment using the main method env.step(action). The environment-simulator knows what the current state is.

The learning limit on one environment is 20 seconds. Be sure to turn off visualizations before submitting, see VERBOSITY in rl_sandbox.py.

The package includes rl_sandbox.py, where you can see basic random browsing, possibly initializing the Q values table, visualization, and so on.

More examples on AI-Gym. Let us recall from the lecture that action is necessary to learn about the environment/

Guidelines

How will your code be called by the evaluation script?

Your code will be called by the evaluation script approximately as follows:

import rl_agent
 
env = kuimaze.HardMaze(...) # here the script creates the environment
 
# Calling your function!
policy = rl_agent.learn_policy(env)  # with 20 seconds limit
 
# Rating of one episode using your policy
observation = env.reset()
state = observation[0:2]
is_done = False
while not is_done:
  action = int(policy[state])
  observation, reward, is_done, _ = env.step(action)
  next_state = observation[0:2]
  total_reward += reward
  state = next_state

How to start the implementation?

Start by describing the HardMaze environment. Next, try to understand the code in the file rl_sandbox.py, where you can see basic random exploration, possible initialization of the table of Q values, visualization, etc.

Q-function representation

You will probably need to work with the Q-function in the implementation. In our discrete world, it will take the form of a table. This table can be represented in different ways, e.g.

as a 3D numpy array (as indicated in the file rl_sandbox.py) that is indexed by three “coordinates”:: x, y, action;
as a dictionary q indexed by the pair state, action, in which you will access individual elements as q[state, action];
like a dictionary of lists q, where the dictionary is indexed by the state and the inner list by the action number, so you will access the individual elements as q[state][action];
etc.

Q-function initialization

Whatever representation of the Q-function you choose, you will need to initialize it somehow, for which it would be useful to know the set of states. In real-world RL tasks, we do not have a “map” of the environment and often we do not even know the set of all states. RL then never ends because we can never be sure that we have already reached all attainable states. But in our task, RL must end and you must return the complete strategy, i.e. the best possible action for each state. Therefore, here, a list of all valid states in the environment can be obtained using the get_all_states() method:

>>> env.get_all_states()
[(x=0, y=0),
 ...
 (x=4, y=2)]

Alternatively, all states can be generated if you know the dimensions of the map:

>>> x_dims = env.observation_space.spaces[0].n
>>> y_dims = env.observation_space.spaces[1].n
>>> maze_size = (x_dims, y_dims)

Grading and deadlines

The task submission deadline can be seen in the Upload system.

The grading is divided as follows:

Automatic evaluation tests your agent's performance on 5 environments. With the policy you supplied for a given agent environment, we release n-times and calculate the average sum of the rewards it earns. This is then compared to a teacher's solution (an optimizing agent). You earn one point on each of the 5 environments in which you have 80% or more than the teaching value of the sum of rewards.
Manual evaluation is based on code quality (clean code).

Evaluation	min	max	note
Rl algorithm quality	0	5	Evaluation of algorithm by automatic evaluation system.
Code quality	0	1	Comments, structure, elegence, code cleanliness, appropriate naming of variables, …

Code Quality (1 points):

Appropriate comments, or the code is understandable enough to not need comments
Reasonably long or short methods / functions
Variable names (nouns) and functions (verbs) help readability and comprehensibility
Code pieces do not repeat (no copy-paste)
Reasonable memory saving and processor time
Consistent names and code layout throughout the file (separating words in the same way, etc.)
Clear code structure (e.g. avoid assigning many variables in one line)

You can follow PEP8, although we do not check all PEP8 demands. Most of the IDEs (certainly PyCharm) point out mishaps with regards to PEP8. You can also read some other sources for inspiration about clean code (e.g., here) or about idiomatic python (e.g., medium, python.net).

Table of Contents

Reinforcement Learning (4th assignment)