====== Task 2: Q-learning and DQN in Stochastic BlockWorld ======

<note important>Note: Brute automatic evaluation currently uses **Python 3.13** and **Pytorch 2.8.0** CPU only </note>

===== Henwas' trouble even greater =====

You sure remember our poor fella Henwas. Since last time, he grew only older and the memory does not serve him as before. Unfortunately, this translates into him making mistakes during moving the crates.

Once again, he turns to you for help. He wants you to prescribe for him what to do in any situation, should he suddenly forget the plan. In another words, he wants you to give him a //policy//.
}
===== Your Job =====
Download  {{ :courses:zui:tasks:task_2_qlearning_dqn.zip |}}. Once again, you are given a BlockWorld problem. Only now, there is a chance that the moved block is put somewhere else. For example:
<code>
$ python blockworld.py

state = [[1], [3, 4], [2]]
actions = [(1, 3), (1, 2), (3, 0), (3, 1), (3, 2), (2, 1), (2, 3)]

<from> <to>: 3 2
state = [[3, 1], [4], [2]]
</code>
Here, the block ''3'' should have been moved to ''2'', but instead ended up on ''1''.

Your tasks are twofold: First of all you will be implementing tabular Q-learning.


=== Task 1 - Tabular Q-learning === 

In ''student.py'', you'll find the following:
<code python>
from blockworld import BlockWorldEnv
import random

class QLearning():
	# don't modify the methods' signatures!
	def __init__(self, env: BlockWorldEnv):
		self.env = env

	def train(self):
		# Use BlockWorldEnv to simulate the environment with reset() and step() methods.

		# s = self.env.reset()
		# s_, r, done = self.env.step(a)

		pass

	def act(self, s):
		random_action = random.choice( s[0].get_actions() )
		return random_action
</code>

Implement Q-learning algorithm in ''train()''. The ''act(s)'' method should return a single action to perform in the given state. To simulate the environment, use ''reset()'' and ''step(a)'' methods of the given ''BlockWorldEnv''.

The ''reset()'' method randomly chooses a new ''state'' and ''goal'' and returns them as a tuple ''(state, goal)''. The ''step(a)'' method performs the action and returns an updated state in form of ''(new_state, goal)'', reward and a termination flag. To get available actions, call ''state.get_actions()''.

Note that your algorithm has to react to all possible combination of states and goals. You can assume that number of blocks will be less or equal to 4.

**Hint:** You can use Python's dictionary ''get(key, default_value)'' method to get a stored value, or the ''default_value'' if the ''key'' is not present.

=== Task 2 - DQN === 

Only 4 blocks is not much though, so we will want to scale up. Here is where the second part of you task comes into play.
The complexity of our problem grows a lot with increasing number of blocks though, so we will still limit ourselves to number of blocks no greater than 5.


You will implement Deep Q-learning (DQN) (see lab8 for more detail), where we replace storing exact values in a dictionary with a neural network approximation.

For this task you will require pytorch. We will need just the CPU installation, which can be installed easily through pip:
<code python>
  pip install torch
</code>

For alternative installation options, you can see https://pytorch.org/get-started/locally/. Be warned that the evaluation does not support GPU execution though.

You are allowed and encouraged to use LLMs to solve this part of the assignment.

To implement DQN we need:
   - Our Q-value network
   - Replay buffer, to sample past transitions.
   - The overall training loop, handling training and updating our target network.


Luckily, the first two are already done in ''dqn_networks.py'' and ''replay_buffer.py'' respectively. Both of these were designed to be sufficient for your task and you do not need to modify these. 

In ''student_dqn.py'' You will find a skeleton of the overall training loop. There are three main methods you have to fill in.

First of all, we need to somehow pass our state to the neural network. You should implement this in a prepared ''state_to_tensor'' method.

<code python>
  def state_to_tensor(self, state) ->torch.Tensor:
    """Convert a state of the BlockWorld environment into a flat tensor.

    Args:
        state (_type_): The state tuple provided by the BlockWorld

    Returns:
        torch.Tensor: A torch suitable flat representation of the state.
    """
    #TODO: Create some suitable representation of the state
    state_tensor = torch.zeros(self.state_dim)
    return state_tensor
</code>


Keep in mind that the network needs to be able to distinguish states. For example, a naive flattening into a vector would cause the states %%[[1], [2], [3]]%% and %%[[1, 2, 3]]%% to appear indistinguishable.

**Hint:** A meaningful representation is to store for each block what it stands on. As this is an integer input, you may also wish to one-hot encode it.


In our environment we receive a reward ''-0.1'' for each move and ''10'' for reaching our goal configuration. Such a reward is very sparse and provides information which actions are good only very rarely. To help with convergence, we will want to design some reward shaping that will drive the learning towards more promising actions. This will be your second task.
<code python>
  def compute_reward_shaping(self, state, next_state) -> float:
    """Computes the reward shaping when moving from state to next_state
    This serves to help the DQN drive the search to promising space,
    as just the standard spare reward of -0.1 or 10 makes the task much harder.

    Args:
        state: The current environment state
        next_state: The environment state we transitioned to.

    Returns:
        float: The reward shaping component, that will be added up to the original reward.
    """
	#TODO: Come up with a suitable reward shaping, that will drive
	# the learning towards the goal.
    return 0
</code>

Be careful not to steer the learning towards undesirable behavior!

With these two components, you will be able to implement the training loop in the ''train'' method, and the action extraction in the ''act'' method, similarly to the tabular Q-learning.

DQN has more hyperparameters than tabular Q-learning. So, we provide you with a config dataclass, where we specify baseline parameter setting. You will likely want to experiment with the number of episodes to run the algorithm for (and how often to save), but the rest of the parameters should not require tuning. It is recommended to add any additional parameters you may need into this config.

<code python>
@dataclass
class DQNConfig:
    #Do not delete these! But you can play with the values.
    #PRNG seed, to get consistent results.
    seed: int = 42
    #Minibatch size
    batch_size: int = 128
    #Network parameters
    hidden_features: int = 256
    #Update rate of the target network
    target_net_update: float = 0.005
    #Overall learning rate
    lr: float = 3e-4
    #Replay buffer parameters
    replay_capacity: int = 10000
    #A prioritized replay trick. Since our
    # environment normally has sparse rewards (only positive reward for correct solution).
    # it is quite hard task for the network to learn. We maintain a separate buffer
    # for the terminal steps, which actually give us our reward.
    # This ratio defines the fraction of the size of the
    # terminal-only buffer w.r.t. the normal buffer.
    # For details, look at the ReplayBuffer implementation.
    replay_terminal_ratio: float = 0.1
    #How many environment episodes to run
    num_episodes: int = 1000
    #How often to checkpoint the model. It will be saved at the interval of episodes defined by this.
    save_interval: int = 100
    #TODO: Your other parameters go here.
</code>

Each ''save_interval'' episodes, the train method will save all you need to restore from this checkpoint (already implemented). Out of the saved data, important for the submission are ''config.json'', ''q_net_ep{episode_number}.pt'' and ''target_net_ep{episode_number}.pt''. These contain your used configuration, and the learned weights of your Q-value and target networks. Since the training takes too long for an automatic evaluation, you will be submitting a checkpoint of this form you wish to evaluate for. **Importantly, rename the saved network weights just to ''q_net.pt'' and ''target_net.pt'' and move them to the same folder as ''student_dqn.py'' before submitting**.

The ''student_dqn.py'' script provide multiple parameters that can be specified via named arguments. Example usage:

To train on a 3-block environment with 1000 evaluation instances:
<code>
python student_dqn.py --num_blocks 3 --test_problems 1000
</code>
To restore a checkpoint stored at episode 5000 for 5 block environment stored at "checkpoints/"(default setting), do not train it further and just evaluate:
<code>
python student_dqn.py --num_blocks 5 --test_problems 1000 --no_train --restore_episode 5000 --model_restore_path "checkpoints"
</code>

**Additional hints:**
   * Start on small instances. On a 3 blocks environment, the network should learn rather quickly and learn to solve 100% of instances within 1000 episodes.
   * Stick with the provided DQN network architecture, optimizer and default parameters (except episodes) to start with. You may get better results with tuning, but it should not be necessary.
   * Do not be discouraged if the episodes take very long at the start of training. If your algorithm is implemented well, the network should learn to act better with increasing training time and greatly shorten the episodes, speeding up the training.
   * On a 5-block environment, it takes approximately an hour and 5000 episodes to learn the reference solution.


===== Evaluation =====
Upload an archive containing ''student.py'', ''student_dqn.py'', ''config.json'' ''q_net.pt'' and ''target_net.pt'' into [[https://cw.felk.cvut.cz/brute/teacher/course/1706|Brute]], where it will be evaluated automatically. If you wish to get points only for the tabular part of the assignment, you may submit only ''student.py''. **Make sure that all these files are contained in the root folder of your submission!**

For the tabular evaluation: First, ''train()'' method of your class is called and given 30 seconds to perform Q-learning. Second, it is evaluated on 1000 instances, each with a limit of 50 steps and a time limit of 5 seconds for all of them together.

You will gain 7 points, if you perform comparatively to the reference solution, and 0 points if you perform close or worse than random. For your information, the reference solutions solves 95% of problems with 6 steps on average. Random sampling solves about 35% of problems with about 20 steps on average.

For DQN: **Your solution will be evaluated on an environment with 5 blocks, so upload your checkpoint trained on such environment**. Your provided checkpoint will be restored and evaluated on 1000 instances, each with a limit of 50 steps and a time limit of 120 seconds for all of them together.
**Your algorithm will be evaluated using its ''act'' method, so be sure to implement it!**

Similarly to the tabular evaluation, you will gain 3 points if you perform comparatively to the reference and 0 if close or worse than a threshold of 20% instances solved. Here, the reference solution (trained for 5000 episodes) solves ~74% of problems with 7 steps on average. Random solves only ~7.4% instances with 25 steps on average.

**Note:** In tabular Q-learning, you don't have to measure the time of your ''train()'' method, the evaluation script will automatically terminate it when 15 seconds elapse. All what you compute in the alloted time and save within your class is then used in the evaluation. However, the evaluation is considered unsuccessful if you exceed 5/120 seconds, and zero points are awarded in that case.

**Note:** You are supposed to implement Q-learning/DQN and we will randomly check your solutions.

===== BONUS: Gymnasium =====

The BlockWorld environment we have provided, follows a syntax of popular (deep) reinforcement learning library [[https://gymnasium.farama.org/index.html|gymnasium]]. The Q-Learning (or any other algorithm) you implement with BlockWorld environment can be easily modified to use within environments defined in gymnasium.

The only problem with adapting the algorithm would be that the gymnasium environments expect that each action can be played in each state which is different to BlockWorld. So you only need to remove the part, where you ask for available actions and instead use an available actions from the environment.

If you would like to try your algorithm with some different environment you can use this code snippet, which loads one of the simplest environments called CartPole, where you are balancing pole on the moving cart by pushing the cart either left or right. In the snippet, we also discretize the state, which is continuous by default. You should be able to use your Q-Learning with this discretization

<code python>
import gymnasium as gym
import numpy as np

 
def test_cartpole():
  
  env = gym.make('CartPole-v1')

  s, info = env.reset()
  bins = 8

  # We generate one more element, just to keep symmetry
  digitization_bins = [
      np.linspace(-2, 2, num=bins + 1),  # cart position
      np.linspace(-2, 2, num=bins + 1),      # cart velocity
      np.linspace(-0.3, 0.3, num=bins + 1),  # pole angle
      np.linspace(-2, 2, num=bins + 1),      # pole angle velocity
  ]


  def digitize(s):
    digitized_s = np.zeros_like(s)
    for i in range(len(s)):
      # We do this because digitize can return bin with index bin, which would be out of range of the list.
      digit = np.minimum(np.digitize(s[i], digitization_bins[i]), len(digitization_bins) - 1)
      digitized_s[i] = digitization_bins[i][digit]
    return digitized_s

  s = digitize(s)
  done = False
  
  while not done:
    a = env.action_space.sample()
    s_, r, terminated, truncated, info = env.step(a)
    
    done = terminated or truncated
    
    s = digitize(s_)
    
    print(s)
  
  
if __name__ == "__main__":
  test_cartpole()


</code>