===== Task13 - Value-iteration policy in pursuit-evasion ======
The main task is to implement Value-iteration policy for robotics pursuit-evasion game.

|**Deadline**  |  13. January 2019, 23:59 PST |
|**Points**  |  6 |
|**Label in BRUTE**  |  Task13 |
|**Files to submit**  |  archive with ''player''|
| |  <fc #ff0000>Minimal content of the archive:</fc> ''player/Player.py''| 
| |  <fc #ff0000>Do not submit the ''.policy'' files with the stored precalculate policy!</fc>| 
|**Resources**  |  {{ :courses:b4m36uir:hw:task11.zip |Task11 resource files}}|

===Assignment===
In file ''player/Player.py'' in function ''value_iteration_policy'' implement the Value-iteration policy decision making for pursuit-evasion game.

The Value-iteration policy is an asymptotically optimal decision making approach. The next-best state is selected in each discrete step of the game based on its value.

The ''value_iteration_policy'' function has the following prescription which follows the prescription of the ''greedy_policy'' from [[courses:b4m36uir:hw:task11|Task11 - Greedy policy in pursuit-evasion]]:<code python>
           def value_iteration_policy(self, gridmap, evaders, pursuers):
        """
        Method to calculate the value-iteration policy action
        
        Parameters
        ----------
        gridmap: GridMap
            Map of the environment
        evaders: list((int,int))
            list of coordinates of evaders in the game (except the player's robots, if he is evader)
        pursuers: list((int,int))
            list of coordinates of pursuers in the game (except the player's robots, if he is pursuer)
        """
</code>

The purpose of the function is to internally update the ''self.next_robots'' variable, which is a list of ''(int, int)'' robot coordinates based on the current state of the game, given ''gridmap'' grid map of the environment and the player's role ''self.role''. The player is given the list ''evaders'' of all evading robots in the game other than his robots and the list of ''pursuers'' of all pursuing robots in the game other than his robots. I.e., the complete set of robots in game is given as the union of ''evaders'', ''pursuers'' and ''self.robots''.\\

During the gameplay, each player is asked to update their intention for the next move coded in the ''self.next_robots'' variable by calling the ''calculate_step'' function. Afterward, the step is performed by calling the ''take_step̈́'' function followed by the game checking each step, whether it complies to the rules of the game.

The game ends after a predefined number of steps or when all the evaders are captured.

<note important>The number of players and robots is fixed for this task. There will be two players in the game. One player with a single evading robot and one player with two pursuers.</note>

In value-iteration the strategies for different configurations may be stored in the ''self.values'' variable which is either calculated from scratch, or loaded from file, if the policy already exists. The provided code for loading the value-iteration policy may be modified; however, the code shall use ''pickle'' library for saving and loading the data to and from the ''.policy'' files. 

===Evaluation===
The code can be evaluated using the following set of game scenarios.\\
{{ :courses:b4m36uir:hw:games.zip | Additional Game Scenarios}}\\

The evaluation code extends for:
<code python>
    games = [("grid", "games/grid_6.game"),
             ("grid", "games/grid_7.game"),
             ("grid", "games/grid_8.game"),
             ("pacman_small", "games/pacman_small_5.game"),
             ("pacman_small", "games/pacman_small_6.game")] 

</code>

Note, you can easily generate new game setups by modifying the ''.game'' files accordingly. In the upload system, the student's solutions are tested against the teachers ''RANDOM'' and ''GREEDY'' policies players. 
Note, the calculation of the ''VALUE_ITERATION'' policy is computationally expensive, therefore is the time for running the evaluation limited to 10 minutes.