Implement the learn_policy (env)
method in a rl_agent.py
file that you will upload to Brute. env
is of type HardMaze
this time. The expected output is policy
, dictionary keyed states, values can be [0,1,2,3] which corresponds to up, right, down, left (N, E, S, W). The learning limit on one tile is 20 seconds. Be sure to turn off visualizations before submitting, see VERBOSITY in rl_sandbox.py
.
Again, we will use the cubic environment. Download the updated kuimaze.zip
package. Visualization methods are the same, as well as initialization, but the basic idea of working with the environment is different. We do not have a map and we can explore the environment using the main method env.step (action)
. The environment-simulator knows what the current state is. We are looking for the best way from start to finish. We want a trip with the highest expected sum of discounted rewards.
obv, reward, done, _ = env.step(action) state = obv[0:2]
You can get the action by random selection:
action = env.action_space.sample()
The package includes rl_sandbox.py
, where you can see basic random browsing, possibly initializing the Q values table, visualization, and so on.
More examples on AI-Gym. Let us recall from the lecture that action is necessary to learn about the environment/
The task submission deadline can be seen in the Upload system.
The grading is divided as follows:
Evaluation | min | max | note |
---|---|---|---|
Rl algorithm quality | 0 | 5 | Evaluation of algorithm by automatic evaluation system. |
Code quality | 0 | 1 | Comments, structure, elegence, code cleanliness, appropriate naming of variables, … |
Code Quality (1 points):
You can follow PEP8, although we do not check all PEP8 demands. Most of the IDEs (certainly PyCharm) point out mishaps with regards to PEP8. You can also read some other sources for inspiration about clean code (e.g., here) or about idiomatic python (e.g., medium, python.net).