Task 2: BlockWorld and Q-learning

Henwas' trouble even greater

You sure remember our poor fella Henwas. Since last time, he grew only older and the memory does not serve him as before. Unfortunately, this translates into him making mistakes during moving the crates.

Once again, he turns to you for help. He wants you to prescribe for him what to do in any situation, should he suddenly forget the plan. In another words, he wants you to give him a policy.

Your Job

Download task2_qlearning.zip. Once again, you are given a BlockWorld problem. Only now, there is a chance that the moved block is put somewhere else. For example:

$ python blockworld.py

state = [[1], [3, 4], [2]]
actions = [(1, 3), (1, 2), (3, 0), (3, 1), (3, 2), (2, 1), (2, 3)]

<from> <to>: 3 2
state = [[3, 1], [4], [2]]

Here, the block 3 should have been moved to 2, but instead ended up on 1.

In student.py, you'll find the following:

from blockworld import BlockWorldEnv
import random
 
class QLearning():
	# don't modify the methods' signatures!
	def __init__(self, env: BlockWorldEnv):
		self.env = env
 
	def train(self):
		# Use BlockWorldEnv to simulate the environment with reset() and step() methods.
 
		# s = self.env.reset()
		# s_, r, done = self.env.step(a)
 
		pass
 
	def act(self, s):
		random_action = random.choice( s[0].get_actions() )
		return random_action

Implement Q-learning algorithm in train(). The act(s) method should return a single action to perform in the given state. To simulate the environment, use reset() and step(a) methods of the given BlockWorldEnv.

The reset() method randomly chooses a new state and goal and returns them as a tuple (state, goal). The step(a) method performs the action and returns an updated state in form of (new_state, goal), reward and a termination flag. To get available actions, call state.get_actions().

Note that your algorithm has to react to all possible combination of states and goals. You can assume that number of blocks will be less or equal to 4.

Hint: You can use Python's dictionary get(key, default_value) method to get a stored value, or the default_value if the key is not present.

Evaluation

Upload your solution (only student.py) into Brute, where it will be evaluated automatically.

First, train() method of your class is called and given 30 seconds to perform Q-learning. Second, it is evaluated on 1000 instances, each with a limit of 50 steps and a time limit of 5 seconds for all of them together.

You will gain 10 points, if you perform comparatively to the reference solution, and 0 points if you perform close or worse than random. For your information, the reference solutions solves 95% of problems with 6 steps on average. Random sampling solves about 35% of problems with about 20 steps on average.

Note: You don't have to measure the time of your train() method, the evaluation script will automatically terminate it when 15 seconds elapse. All what you compute in the alloted time and save within your class is then used in the evaluation. However, the evaluation is considered unsuccessful if you exceed 5 seconds, and zero points are awarded in that case.

Note: You are supposed to implement Q-learning and we will randomly check your solutions.

BONUS: Gymnasium

The BlockWorld environment we have provided, follows a syntax of popular (deep) reinforcement learning library gymnasium. The Q-Learning (or any other algorithm) you implement with BlockWorld environment can be easily modified to use within environments defined in gymnasium.

The only problem with adapting the algorithm would be that the gymnasium environments expect that each action can be played in each state which is different to BlockWorld. So you only need to remove the part, where you ask for available actions and instead use an available actions from the environment.

If you would like to try your algorithm with some different environment you can use this code snippet, which loads one of the simplest environments called CartPole, where you are balancing pole on the moving cart by pushing the cart either left or right. In the snippet, we also discretize the state, which is continuous by default. You should be able to use your Q-Learning with this discretization

import gymnasium as gym
import numpy as np
 
 
def test_cartpole():
 
  env = gym.make('CartPole-v1')
 
  s, info = env.reset()
  bins = 8
 
  # We generate one more element, just to keep symmetry
  digitization_bins = [
      np.linspace(-2, 2, num=bins + 1),  # cart position
      np.linspace(-2, 2, num=bins + 1),      # cart velocity
      np.linspace(-0.3, 0.3, num=bins + 1),  # pole angle
      np.linspace(-2, 2, num=bins + 1),      # pole angle velocity
  ]
 
 
  def digitize(s):
    digitized_s = np.zeros_like(s)
    for i in range(len(s)):
      # We do this because digitize can return bin with index bin, which would be out of range of the list.
      digit = np.minimum(np.digitize(s[i], digitization_bins[i]), len(digitization_bins) - 1)
      digitized_s[i] = digitization_bins[i][digit]
    return digitized_s
 
  s = digitize(s)
  done = False
 
  while not done:
    a = env.action_space.sample()
    s_, r, terminated, truncated, info = env.step(a)
 
    done = terminated or truncated
 
    s = digitize(s_)
 
    print(s)
 
 
if __name__ == "__main__":
  test_cartpole()

Table of Contents

Task 2: BlockWorld and Q-learning

Henwas' trouble even greater

Your Job

Evaluation

BONUS: Gymnasium