Search
The task will again use the kuimaze environment: kuimaze.zip package (updated 2023-04-24).
kuimaze.zip
Implement the learn_policy(env) method in a rl_agent.py file that you will upload to Brute. You'll want to implement the Q-learning algorithm (though you can try other things as an extra such as Direct Evaluation). The goal is to find best possible strategy (policy) that will lead us to the goal. We want the strategy with the highest expected sum of discounted rewards.
learn_policy(env)
rl_agent.py
The output of the function should be a strategy (policy). Its representation is identical to the previous task, i.e. a dictionary that assigns an action to each state. The actions are marked with the values [0,1,2,3], which corresponds to the actions up, right, down, left (N,E,S,W).
The input (env) is an instance of the HardMaze class. Environment initialization and visualization methods are the same as in MDPMaze, but working with the environment is different. Recall, from the lecture, that action is necessary in order to learn anything about the environment at all. We do not have a map and we can explore the environment using the main method env.step(action). The environment-simulator knows what the current state is.
env
HardMaze class
env.step(action)
The learning limit on one environment is 20 seconds. Be sure to turn off visualizations before submitting, see VERBOSITY in rl_sandbox.py.
rl_sandbox.py
The package includes rl_sandbox.py, where you can see basic random browsing, possibly initializing the Q values table, visualization, and so on.
More examples on AI-Gym. Let us recall from the lecture that action is necessary to learn about the environment/
Your code will be called by the evaluation script approximately as follows:
import rl_agent env = kuimaze.HardMaze(...) # here the script creates the environment # Calling your function! policy = rl_agent.learn_policy(env) # with 20 seconds limit # Rating of one episode using your policy observation = env.reset() state = observation[0:2] is_done = False while not is_done: action = int(policy[state]) observation, reward, is_done, _ = env.step(action) next_state = observation[0:2] total_reward += reward state = next_state
Start by describing the HardMaze environment. Next, try to understand the code in the file rl_sandbox.py, where you can see basic random exploration, possible initialization of the table of Q values, visualization, etc.
You will probably need to work with the Q-function in the implementation. In our discrete world, it will take the form of a table. This table can be represented in different ways, e.g.
numpy
q
state
action
q[state, action]
q[state][action]
Whatever representation of the Q-function you choose, you will need to initialize it somehow, for which it would be useful to know the set of states. In real-world RL tasks, we do not have a “map” of the environment and often we do not even know the set of all states. RL then never ends because we can never be sure that we have already reached all attainable states. But in our task, RL must end and you must return the complete strategy, i.e. the best possible action for each state. Therefore, here, a list of all valid states in the environment can be obtained using the get_all_states() method:
get_all_states()
>>> env.get_all_states() [(x=0, y=0), ... (x=4, y=2)]
>>> x_dims = env.observation_space.spaces[0].n >>> y_dims = env.observation_space.spaces[1].n >>> maze_size = (x_dims, y_dims)
The task submission deadline can be seen in the Upload system.
The grading is divided as follows:
Code Quality (1 points):
You can follow PEP8, although we do not check all PEP8 demands. Most of the IDEs (certainly PyCharm) point out mishaps with regards to PEP8. You can also read some other sources for inspiration about clean code (e.g., here) or about idiomatic python (e.g., medium, python.net).