Search
You sure remember our poor fella Henwas. Since last time, he grew only older and the memory does not serve him as before. Unfortunately, this translates into him making mistakes during moving the crates.
Once again, he turns to you for help. He wants you to prescribe for him what to do in any situation, should he suddenly forget the plan. In another words, he wants you to give him a policy.
Download task2_qlearning.zip. Once again, you are given a BlockWorld problem. Only now, there is a chance that the moved block is put somewhere else. For example:
$ python blockworld.py state = [[1], [3, 4], [2]] actions = [(1, 3), (1, 2), (3, 0), (3, 1), (3, 2), (2, 1), (2, 3)] <from> <to>: 3 2 state = [[3, 1], [4], [2]]
3
2
1
In student.py, you'll find the following:
student.py
from blockworld import BlockWorldEnv import random class QLearning(): # don't modify the methods' signatures! def __init__(self, env: BlockWorldEnv): self.env = env def train(self): # Use BlockWorldEnv to simulate the environment with reset() and step() methods. # s = self.env.reset() # s_, r, done = self.env.step(a) pass def act(self, s): random_action = random.choice( s[0].get_actions() ) return random_action
Implement Q-learning algorithm in train(). The act(s) method should return a single action to perform in the given state. To simulate the environment, use reset() and step(a) methods of the given BlockWorldEnv.
train()
act(s)
reset()
step(a)
BlockWorldEnv
The reset() method randomly chooses a new state and goal and returns them as a tuple (state, goal). The step(a) method performs the action and returns an updated state in form of (new_state, goal), reward and a termination flag. To get available actions, call state.get_actions().
state
goal
(state, goal)
(new_state, goal)
state.get_actions()
Note that your algorithm has to react to all possible combination of states and goals. You can assume that number of blocks will be less or equal to 4.
Hint: You can use Python's dictionary get(key, default_value) method to get a stored value, or the default_value if the key is not present.
get(key, default_value)
default_value
key
Upload your solution (only student.py) into Brute, where it will be evaluated automatically.
First, train() method of your class is called and given 30 seconds to perform Q-learning. Second, it is evaluated on 1000 instances, each with a limit of 50 steps and a time limit of 5 seconds for all of them together.
You will gain 10 points, if you perform comparatively to the reference solution, and 0 points if you perform close or worse than random. For your information, the reference solutions solves 95% of problems with 6 steps on average. Random sampling solves about 35% of problems with about 20 steps on average.
Note: You don't have to measure the time of your train() method, the evaluation script will automatically terminate it when 15 seconds elapse. All what you compute in the alloted time and save within your class is then used in the evaluation. However, the evaluation is considered unsuccessful if you exceed 5 seconds, and zero points are awarded in that case.
Note: You are supposed to implement Q-learning and we will randomly check your solutions.
The BlockWorld environment we have provided, follows a syntax of popular (deep) reinforcement learning library gymnasium. The Q-Learning (or any other algorithm) you implement with BlockWorld environment can be easily modified to use within environments defined in gymnasium.
The only problem with adapting the algorithm would be that the gymnasium environments expect that each action can be played in each state which is different to BlockWorld. So you only need to remove the part, where you ask for available actions and instead use an available actions from the environment.
If you would like to try your algorithm with some different environment you can use this code snippet, which loads one of the simplest environments called CartPole, where you are balancing pole on the moving cart by pushing the cart either left or right. In the snippet, we also discretize the state, which is continuous by default. You should be able to use your Q-Learning with this discretization
import gymnasium as gym import numpy as np def test_cartpole(): env = gym.make('CartPole-v1') s, info = env.reset() bins = 8 # We generate one more element, just to keep symmetry digitization_bins = [ np.linspace(-2, 2, num=bins + 1), # cart position np.linspace(-2, 2, num=bins + 1), # cart velocity np.linspace(-0.3, 0.3, num=bins + 1), # pole angle np.linspace(-2, 2, num=bins + 1), # pole angle velocity ] def digitize(s): digitized_s = np.zeros_like(s) for i in range(len(s)): # We do this because digitize can return bin with index bin, which would be out of range of the list. digit = np.minimum(np.digitize(s[i], digitization_bins[i]), len(digitization_bins) - 1) digitized_s[i] = digitization_bins[i][digit] return digitized_s s = digitize(s) done = False while not done: a = env.action_space.sample() s_, r, terminated, truncated, info = env.step(a) done = terminated or truncated s = digitize(s_) print(s) if __name__ == "__main__": test_cartpole()