===== T4c-vi - Value-iteration policy in pursuit-evasion ======
The main task is to implement a Value-iteration policy for robotics pursuit-evasion game.

|**Deadline**  |  4. January 2020, 23:59 PST |
|**Points**  |  6 |
|**Label in BRUTE**  |  t4c-vi |
|**Files to submit**  |  ''Player.py''|
| |  <fc #ff0000>Do not submit the generated policy file.</fc>|
|**Resources**  | {{ :courses:b4m36uir:hw:uir-t4c-vi.zip |T4c-vi resource files}} |

\\

===Assignment===
<note important>The number of players and robots is fixed for this task. There will be two players in the game. One player with a single evading robot and one player with two pursuers.</note>

In file ''player/Player.py'' in function ''value_iteration_policy'' implement the Value-iteration policy decision making for pursuit-evasion game.


The ''value_iteration_policy'' function has the following prescription:<code python>
def value_iteration_policy(self, gridmap, evaders, pursuers):
        """
        Method to calculate the value-iteration policy action
        
        Parameters
        ----------
        gridmap: GridMap
            Map of the environment
        evaders: list((int,int))
            list of coordinates of evaders in the game (except the player's robots, if he is evader)
        pursuers: list((int,int))
            list of coordinates of pursuers in the game (except the player's robots, if he is pursuer)
        """
</code>

The purpose of the function is to generate a value iteration policy and save the resulting policy in a file. Example policy creation and save is in the resource pack of the task. The file is created using the ''pickle'' library and has a filename ''./policies/'name_of_the_game'number.policy'' where number is 1 for pursuer and 2 for evader. So policy for evader on map ''grid'' should be saved in the file ''./policies/grid2.policy''. In the file should be a tuple ''(p, i2c, c2i)''. ''i2c'' is a mapping from integer indexes to passable coordinates in the map and ''c2i'' is mapping from coordinates to indexes. Both are in the form of a dictionary. Let ''n'' be the number of passable coordinates, then ''p'' is an array of size (n,n,n) for evader and (n,n,n,2) for pursuers that maps index of positions of (evader, pursuer1, pursuer2) to index of position of evader or [pursuer1, pursuer2].

During the gameplay, our player will load the policy produced by your player and will play according to it.

The game ends after a predefined number of steps or when all the evaders are captured.

\\

===Computing the policy===
Adaptation of algorithm from MDPs. Iteratively updates values for each state and converges to optimum if each state is updated infinitely often. Let states ''S'' be all passable coordinates and ''A'' be a function of a state returning all pairs of actions applicable in the given state. ''r'' is a function of state and pair of actions that returns immediate payoff and T is transition function that returns next state given state and actions. The algorithm operates as follows

\begin{align}
&\forall s \in S \quad \text{initialize} \quad v(s) = 0 \quad \text{and until v converges} \\
&\forall s \in S \\
& \quad \quad \forall (a_1,a_2) \in A(s) \\
& \quad  \quad \quad \quad Q(a_1,a_2) = r(s,a_1,a_2) + \gamma v(T(s,a_1,a_2)) \\
& \quad \quad v(s) = \max_x \min_y xQy
\end{align}

where Q is accumulated matrix, corresponding to a matrix game between player. 'x' and 'y' are strategies for player 1 and player 2 in the matrix game Q (so they are all possible moves for both players). After convergence (or some fixed amount of steps) in each state, you can compute the resulting policy as the action that maximizes the payoff given computed values which gives you a policy that should be able to easily defeat ''greedy'' policy. However, to obtain the optimal strategy, the correct approach is to perform one more iteration and when computing $v(s) = \max_x \min_y xQy$ instead from the same computation get the ''x''. $x = \text{arg} \max_x \min_y xQy$. In the small example that I show here the optimal strategy in the first step is to pick both free spaces with the same probability. However, choosing a single best action obviously cannot create such a strategy

{{:courses:b4m36uir:hw:small0.png?400|Vi example}}

\\

===Evaluation===
Automatic evaluation is prepared for this task. You will upload your player, it will generate the policy and then the game will be played by our player using your policy. Your submission will be tested as both roles against the ''greedy'' policy and as a pursuer against our ''value_iteration''.