Search
Implement the learn_policy (env) method in a rl_agent.py file that you will upload to Brute. env is of type HardMaze this time. The expected output is policy, dictionary keyed states, values can be [0,1,2,3] which corresponds to up, right, down, left (N, E, S, W). The learning limit on one tile is 20 seconds. Be sure to turn off visualizations before submitting, see VERBOSITY in rl_sandbox.py.
learn_policy (env)
rl_agent.py
env
HardMaze
policy
rl_sandbox.py
Again, we will use the cubic environment. Download the updated kuimaze.zip package. Visualization methods are the same, as well as initialization, but the basic idea of working with the environment is different. We do not have a map and we can explore the environment using the main method env.step (action). The environment-simulator knows what the current state is. We are looking for the best way from start to finish. We want a trip with the highest expected sum of discounted rewards.
kuimaze.zip
env.step (action)
obv, reward, done, _ = env.step(action) state = obv[0:2]
You can get the action by random selection:
action = env.action_space.sample()
The package includes rl_sandbox.py, where you can see basic random browsing, possibly initializing the Q values table, visualization, and so on.
More examples on AI-Gym. Let us recall from the lecture that action is necessary to learn about the environment/
The task submission deadline can be seen in the Upload system.
The grading is divided as follows:
Code Quality (1 points):
You can follow PEP8, although we do not check all PEP8 demands. Most of the IDEs (certainly PyCharm) point out mishaps with regards to PEP8. You can also read some other sources for inspiration about clean code (e.g., here) or about idiomatic python (e.g., medium, python.net).