Warning

# Task 5 - EN - Grid-world

The description of the task might significantly change before its publication on Tuesday.

The goal of this assignment is to find optimal decision policy of an agent maximizing its reward in a GridWorld MDP problem (see slides or Wikipedia for details). A GridWorld consists of a rectangular grid of $m \times n$ cells. The cells are described using coordinate system with $[0,0]$ in top left corner. An agent can make for different actions — it can go one cell north, east, south, and west from the cell it is occupying. The actions are not deterministic and the intended action will be executed only with probability $p$ (action_proba) and some other actions will be executed with probability $1-p$. More specifically, only the actions that are neighbouring the intended cardinal direction might be executed instead and each of them with uniform probability, i.e. if the agent's intended action is north, it will be executed with probability $p$ and action east or west will be executed instead with probability $\frac{1-p}{2}$. Each action has an associated cost $c$ (action_cost), which, for the purposes of this assignment, is identical for each action. This cost is applied only when the cell where the action would have ended has no associated reward — if the cell $[i,j]$ has an associated reward $r_{i,j}$, the reward overrides the cost.

In all the predefined worlds, the cells with defined reward are also set be terminal states where the agent ends when he reaches them. It is recommend for your experiments to keep the set of reward states the same as the set of the terminal state. A reward (or a cost if it is negative) is obtained if the associated cell is reached by the agent.

### Algorithms

While the MDPs can be solved by both linear programming and dynamic programming, this task focuses on the latter — namely, only two variants of dynamic programming for solving the MDP are needed: value iteration and policy iteration. The dynamic programming approach consists of iterative estimation of the values of individual states $V(s)$ and choosing the best policy $\pi(s)$ for deciding for action $a$ from the set of actions $\mathbb{A}$ for each state 0 $s$ from the set of states $\mathbb{S}$.

First, the action are evaluated using known valuation of states $V_{n-1}(s)$ and the transitional probabilities $P(s' | a, s)$ where $s'$ is the target state, $s$ is the current state, and $a$ is the executed action:

$$Q_{n}(s,a) =R(s) + \sum_{s' \in \mathbb{S}} P(s' | a, s) \gamma V_{n-1}(s')$$

where $R(s)$ is the reward for going from state $s$ and $\gamma \in [0,1]$ is the discount factor.

The optimal policy $\pi_{n}(s)$ is chosen as $$\pi_{n}(s) = argmax_{\substack{ a \in \mathbb{A} } } Q_{n}(s,a)$$

and finally, the valuation of states $V_n$ is recomputed:

$$V_n =R(s) + \sum_{s' \in \mathbb{S}} P(s' | \pi_n(s), s) \gamma V_{n-1}(s')$$

#### Value iteration

The value iteration algorithm skips explicitly computing the optimal policy and consists of a single step repeated until convergence criterion is met: $$V_n = \max_{a \in \mathbb{A}} \left(R(s) + \sum_{s' \in \mathbb{S}} P(s' | a, s)\gamma V_{n-1}(s')\right)$$

#### Policy iteration

The policy iteration consists of two steps — computation the optimal policy $\pi_n(s)$ and evaluation of states $V_n(s)$ given the policy $\pi_n(s)$. The evaluation of states for given policy might be either computed solving set of equations or iteratively similarly as in the value iteration. The assignment uses the iterative version which consists of repeated computation of $Q_{n}(s,a)$ and $V_n$ until the $V_n$ meets the convergence criterion.

### Implementation

There are several implementation details that are worth. First, while, the functions are defined as multidimensional arrays — $R(s)$ is replaced by an array $R$ of shape $|\mathbb{S}|$ where $R[s] = R(s)$, $P(s' | a, s)$ is replaced by $P$ of shape $|\mathbb{S}| \times |\mathbb{A}| \times |\mathbb{S}|$ where $P[s,a,s'] = P(s' | a, s)$. Also $V_n(s)$ is saved as an array $V_n$ of shape $|\mathbb{S}|$ and $Q_n(s,a)$ as $Q_n$ of shape $|\mathbb{S}| \times |\mathbb{A}|$.

The set of states $\mathbb{S}$ consists of all cells from the grid and a terminal sink state which is an added state that cannot be left (all action with lead back to it with probability 1) and which has no reward for reaching it. All action from the states on the grid that are considered to be terminal leads to the sink state to prevent repeated application of rewards that are associated with such states.

#### Environment

The task is to be implemented and the provided codes are in python 3.6. The other necessary packages are NumPy, matplotlib, seaborn.

It is recommended to use Anaconda distribution which also contains a package manager which allows to install many pre-compiled packages (which is especially beneficial when using Windows as it is often problematic to compile packages there).

The task consists of two parts, an implementative one and experimental one. The goal of the implementative part is to implement missing parts of several functions while the experimental parts consists of several small experiments with the GridWorld. The output of the assignment are the implemented codes, script (or Jupyter notebook) for launching the experiments, and report.

#### Implementative part [4p]

Your goal is to implement several helper functions (Q_from_V, Q2V, Q2Vbypolicy, and Q2policy), evaluation of the MDP for given policy (evaluate_policy), value iteration (value_iteration), and policy iteration (policy_iteration) in the file ZUI_MDP.py. You are also provided a set of test cases (test_ZUI_MDP.py) that can be used for testing your implementation using e.g. unittest or nosetests modules. It is not necessary to use the tests but it is recommended as the correctness of your implementation will be tested similarly (mostly same tests with just different parameterization).

#### Experimental part [4.5p]

Once you have correctly implemented all the necessary functions, your goal is to experiment with the GridWorld MDPs. You are required to do at least 3 different experiments from which two are assigned (and are the same for all of you) and you are required to come up with your own ideas for the remaining one. All the experiments have to be thoroughly described in the report — the settings, used GridWorld parametrization, goals of the experiments, results (including visualization) and (if necessary) a conclusion.

##### Experiment 1: Policy switching based on action probability

Your goal is to analyze how the optimal policy changes with the changes of action probability $p$ for the predefined GridWorld 3×4. The output is to be a list of values of $p$ for which there occurs a change in the optimal policy and also a plot showing the valuation of states 0, 3, 6, 8, 9, 10, and 11 together with thresholds where the policy change occurs. The plot should be similar to plot experiment_1_3x3.pdf which shows the Experiment 1 for the GridWorld 3×3 and states 0, 1, 3, 6, and 8. The probability $p$ should be on the x axis and the valuation on the y axis.

##### Experiment 2: Policy switching based on action cost

Your goal is to analyze how the optimal policy changes with the changes of action costs $c \in [0,\infty]$ for the predefined GridWorld 5×5. The output is to be a list of values of $c$ for which there occurs a change in the optimal policy. You also should show plots of the several most interesting policies (you can use method GridWorld.plot) — the exact policies you show are up to you.

##### Experiments 3 (or more)

You are required to come up with your own experiment that should be at least as complex as the previous two experiments. The possibilities are almost endless, e.g. measuring the runtime based on the size of the grid, influence of both action cost $c$ and action probability $p$ on certain states (you would plot a heatmap where on the $x$ axis would be action probability $p$, on the $y$ axis would be action cost $c$ and the color would encode the value of the given state), or simple evolutionary algorithms for finding good policies (using function evaluate_policy for evaluation the objective or writing a custom evaluation function that solves equations for evaluation the policy instead of iterating). Your experiment should be different from the experiments described above (i.e. it is not sufficient to just take the experiments described above and change the GridWorld instance to another predefined GridWorld instance).

You are required to include stand-alone scripts named <username>_experiment_<n>.py (e.g. kuncvlad_experiment_2.py) that are launchable in the described environment. For experiments running longer than 5 minutes, write the approximate running time in the report. Your experiment can be also in Jupyter notebooks (.ipynb).

#### Report [1.5p]

You are required to describe all the steps in a short report. The report has no fixed length limit but it is evaluated on the basis of completeness and correctness of presented information. You can get up to 1.5 points for the neatness and formal requirements of the report — e.g. your figures should have captions summarizing what is there, you should reference figures from the text (usage of \cref{} is recommended), your report should have a logical structure, etc. You are required to hand in the report in the PDF format (any PDF format that graders can open is acceptable but it is guaranteed that format PDF/A is fine). You are not required to use LaTeX but it is highly recommended, furthermore a LaTeX template ZUI_template.tex is provided for your use (and it is also recommended to use the template).

##### LaTeX editor

Unless you have installed LaTeX locally, it is recommended to use Overleaf v2 which provides an online editor and also has many predefined templates and also allows online collaboration (which might be useful for your other projects). The Overleaf v2 is the result of a merge of Overleaf v1 and ShareLatex several years ago.

The whole assignment is worth 10 points in total. The implementative part is for 4 points in total — implementation of the helper functions is worth 1 point and the implementation of evaluate_policy, value_iteration, and policy_iteration is worth 1 point for each function. The experimental part is worth 4.5 points in total where each experiment is worth 1.5 point (Note that the points are transferable between experiments — a great experiment might offset poorer experiment). The final 1.5 points are for the report itself. Summary of the grading is shown below:

helper functions 1
evaluate_policy 1
value_iteration 1
policy_iteration 1
Experiment 1 1.5
Experiment 2 1.5
Experiment 3 1.5
report 1.5
total 10