====== Task 5 - EN - Grid-world ======

The goal of this assignment is to find optimal decision policy of an agent maximizing its reward in a GridWorld MDP problem (see [[http://cw.fel.cvut.cz/wiki/_media/courses/b4b36zui/mdps_show_2019.pdf|slides]] or [[https://en.wikipedia.org/wiki/Markov_decision_process|Wikipedia]] for details). A GridWorld consists of a rectangular grid of $m \times n$ cells. The cells are described using coordinate system with $[0,0]$ in top left corner. An agent can make for different actions --- it can go one cell north, east, south, and west from the cell it is occupying. The actions are not deterministic and the intended action will be executed only with probability $p$ (''action_proba'') and some other actions will be executed with probability $1-p$. More specifically, only the actions that are neighbouring the intended cardinal direction might be executed instead and each of them with uniform probability, i.e. if the agent's intended action is //north//, it will be executed with probability $p$ and action //east// or //west// will be executed instead with probability $\frac{1-p}{2}$. Each action has an associated cost $c$ (''action_cost''), which, for the purposes of this assignment, is identical for each action. This cost is applied only when the cell where the action would have ended has no associated reward --- if the cell $[i,j]$ has an associated reward $r_{i,j}$, the reward overrides the cost.

In all the predefined worlds, the cells with defined reward are also set be terminal states where the agent ends when he reaches them. It is recommend for your experiments to keep the set of reward states the same as the set of the terminal state. A reward (or a cost if it is negative) is obtained if the associated cell is reached by the agent.

==== Algorithms ====
While the MDPs can be solved by both linear programming and dynamic programming, this task focuses on the latter --- namely, only two variants of dynamic programming for solving the MDP are needed: //value iteration// and //policy iteration//. The dynamic programming approach consists of iterative estimation of the values of individual states $V(s)$ and choosing the best policy $\pi(s)$ for deciding for action $a$ from the set of actions $\mathbb{A}$ for each state 0 $s$ from the set of states $\mathbb{S}$.

First, the actions are evaluated using known valuation of states $V_{n-1}(s)$ and the transitional probabilities $P(s' | a, s)$ where $s'$ is the target state, $s$ is the current state, and $a$ is the executed action:

\begin{equation}
    Q_{n}(s,a) =R(s) + \sum_{s' \in \mathbb{S}} P(s' | a, s) \gamma V_{n-1}(s')
\end{equation}

where $R(s)$ is the reward for going from state $s$ and $\gamma \in [0,1]$ is the discount factor.

The optimal policy $\pi_{n}(s)$ is chosen as
\begin{equation}
    \pi_{n}(s) = argmax_{\substack{
        a \in \mathbb{A}
        }
    } Q_{n}(s,a)
\end{equation}

and finally, the valuation of states $V_n$ is recomputed:

\begin{equation}
    V_n =R(s) + \sum_{s' \in \mathbb{S}} P(s' | \pi_n(s), s) \gamma V_{n-1}(s')
\end{equation}

===Value iteration===
The //value iteration// algorithm skips explicitly computing the optimal policy and consists of a single step repeated until convergence criterion is met:
\begin{equation}
    V_n = \max_{a \in \mathbb{A}} \left(R(s) + \sum_{s' \in \mathbb{S}} P(s' | a, s)\gamma V_{n-1}(s')\right)
\end{equation}


===Policy iteration===
The //policy iteration// consists of two steps --- computation of the optimal policy $\pi_n(s)$ and evaluation of states $V_n(s)$ given the policy $\pi_n(s)$. The evaluation of states for given policy might be either computed solving set of equations or iteratively similarly as in the //value iteration//. The assignment uses the iterative version which consists of repeated computation of $Q_{n}(s,a)$ and $V_n$ until the $V_n$ meets the convergence criterion.

====Implementation====
There are several implementation details that are worth. First, while, the functions are defined as multidimensional arrays --- $R(s)$ is replaced by an array $R$ of shape $|\mathbb{S}|$ where $R[s] = R(s)$, $P(s' | a, s)$ is replaced by $P$ of shape $|\mathbb{S}| \times |\mathbb{A}| \times |\mathbb{S}|$ where $P[s,a,s'] = P(s' | a, s)$. Also $V_n(s)$ is saved as an array $V_n$ of shape $|\mathbb{S}|$  and $Q_n(s,a)$ as $Q_n$ of shape $|\mathbb{S}| \times |\mathbb{A}|$.

The set of states $\mathbb{S}$ consists of all cells from the grid and a **terminal sink state** which is an added state that cannot be left (all action with lead back to it with probability 1) and which has no reward for reaching it. All action from the states **on** the grid that are considered to be terminal leads to the sink state to prevent repeated application of rewards that are associated with such states.\\ \\ 

===Environment===
The task is to be implemented and the provided codes are in **python 3.8**. The other necessary packages are //NumPy//, //matplotlib//, //seaborn//.

It is recommended to use [[https://www.anaconda.com/download/|Anaconda distribution]] which also contains a package manager which allows to install many pre-compiled packages (which is especially beneficial when using Windows as it is often problematic to compile packages there).


====Task====
The task consists of two parts, an implementative one and experimental one. The goal of the implementative part is to implement missing parts of several functions while the experimental parts consists of several small experiments with the GridWorld. The output of the assignment are the implemented codes, script (or Jupyter notebook) for launching the experiments, and report.\\ \\ 
 
===Implementative part [4p]===
Your goal is to implement several helper functions (''Q_from_V'', ''Q2V'', ''Q2Vbypolicy'', and ''Q2policy''), evaluation of the MDP for given policy (''evaluate_policy''), value iteration (''value_iteration''), and policy iteration (''policy_iteration'') in the file ''ZUI_MDP.py''. You are also provided a set of test cases (''test_ZUI_MDP.py'') that can be used for testing your implementation using e.g. //unittest// or //nosetests// modules. It is not necessary to use the tests but it is recommended as the correctness of your implementation will be tested similarly (mostly same tests with just different parameterization).\\ \\ 

===Experimental part [4.5p]===
Once you have correctly implemented all the necessary functions, your goal is to experiment with the GridWorld MDPs. You are required to do at least 3 different experiments from which two are assigned (and are the same for all of you) and you are required to come up with your own ideas for the remaining one. All the experiments have to be thoroughly described in the report --- the settings, used GridWorld parametrization, goals of the experiments, results (including visualization) and (if necessary) a conclusion.\\ \\ 

==Experiment 1: Policy switching based on action probability==
Your goal is to analyze how the optimal policy changes with the changes of action probability $p$ for the predefined GridWorld ''3x4''. The output is to be a list of values of $p$ for which there occurs a change in the optimal policy and also a plot showing the valuation of states 0, 3, 6, 8, 9, 10, and 11 together with thresholds where the policy change occurs. The plot should be similar to plot {{experiment_1_3x3.pdf?linkonly}} which shows the Experiment 1 for the GridWorld ''3x3'' and states 0, 1, 3, 6, and 8. The probability $p$ should be on the x axis and the valuation on the y axis.\\ \\ 

==Experiment 2: Policy switching based on action cost==
Your goal is to analyze how the optimal policy changes with the changes of action costs $c \in [0,\infty]$ for the predefined GridWorld ''5x5''. The output is to be a list of values of $c$ for which there occurs a change in the optimal policy. You also should show plots of the several most interesting policies (you can use method ''GridWorld.plot'') --- the exact policies you show are up to you.\\ \\ 

==Experiments 3 (or more)==
You are required to come up with your own experiment that should be at least as complex as the previous two experiments. The possibilities are almost endless, e.g. measuring the runtime based on the size of the grid, influence of both action cost $c$ and action probability $p$ on certain states (you would plot a heatmap where on the $x$ axis would be action probability $p$, on the $y$ axis would be action cost $c$ and the color would encode the value of the given state), or simple evolutionary algorithms for finding good policies (using function ''evaluate_policy'' for evaluation the objective or writing a custom evaluation function that solves equations for evaluation the policy instead of iterating). **Your** experiment **should be different** from the experiments described above (i.e. it is not sufficient to just take the experiments described above and change the GridWorld instance to another predefined GridWorld instance).

You are required to include stand-alone scripts named ''<username>_experiment_<n>.py'' (e.g. ''kuncvlad_experiment_2.py'') that are launchable in the described environment. For experiments running longer than 5 minutes, write the approximate running time in the report. Your experiment can be also in [[http://jupyter.org/|Jupyter notebooks]] (''.ipynb'').\\ \\ 

===Report [1.5p]===
You are required to describe all the steps in a short report. The report has no fixed length limit but it is evaluated on the basis of completeness and correctness of presented information. You can get up to **1.5 points** for the neatness and formal requirements of the report --- e.g. your figures should have captions summarizing what is there, you should reference figures from the text (usage of ''\cref{}'' is recommended), your report should have a logical structure, etc. You are required to hand in the report in the PDF format (any PDF format that graders can open is acceptable but it is guaranteed that format ''PDF/A'' is fine). You are **not required** to use LaTeX  but it is **highly recommended**, furthermore a LaTeX template ''ZUI_template.tex'' is provided for your use (and it is also **recommended** to use the template).\\ \\ 

==LaTeX editor==
Unless you have installed LaTeX locally, it is recommended to use  [[https://www.overleaf.com/?r=8e97ffd0&rm=d&rs=b|Overleaf v2]] which provides an online editor and also has many predefined templates and also allows online collaboration (which might be useful for your other projects). The Overleaf v2 is the result of a merge of Overleaf v1 and ShareLatex several years ago.

====Grading====
The whole assignment is worth **10 points in total**. The implementative part is for **4 points in total** --- implementation of the helper functions is worth **1 point** and the implementation of ''evaluate_policy'', ''value_iteration'', and ''policy_iteration'' is worth **1 point for each function**. The experimental part is worth **4.5 points in total** where each experiment is worth **1.5 point** (Note that the points are transferable between experiments --- a great experiment might offset poorer experiment). The final **1.5 points** are for the report itself. Summary of the grading is shown below:

^subtask ^ points ^
|helper functions | 1 |
|''evaluate_policy'' | 1|
|''value_iteration'' | 1 |
|''policy_iteration'' | 1 |
|Experiment 1 | 1.5 |
|Experiment 2 | 1.5 | 
|Experiment 3 | 1.5 |  
|report | 1.5 |
^total ^ 10 ^


====Deadline====
The deadline for submission into the [[https://cw.felk.cvut.cz/brute/student/|upload system]] is **<del>12.5.2020 23:59:59 CES</del>** **17.5.2020 23:59:59 CES**.

====Files====
All necessary files are in {{assignment_5_mdp_ver_4.zip?linkonly}}. The archive includes the python template, the test code and also the LaTeX template. It also includes a document describing the assignment (including potentially helpful tips) and its LaTeX source code.