====== T4a-oracle - Adversarial Path Planning ======
An attacker tries to steal an item from a building while the defender's goal is to detect and catch the attacker with installed sensors. The task is to implement an algorithm to find the optimal strategies for the attacker and defender.

|**Due date**  |  January 13, 2024, 23:59 PST |
|**Deadline**  |  January 13, 2024, 23:59 PST |
|**Points**  |  7 |
|**Label in BRUTE**  |  t4a-oracle |
|**Evaluation**  | upload zip consisting of ''sensor_oracle.py'', ''planning_oracle.py'' and ''double_oracle.py'' to brute |
|**Resources**  |  {{ :courses:uir:hw:t4a.zip | B4M36UIR_t4a_resource_pack }} |

\\

===Assignment===

The building is represented as a grid. In the grid, there are free cells (white), walls (grey), entrances (green), and goals (yellow). The attacker's objective is to get to the goal. The attacker can use any  entrance to enter the building and then leave it using the same or different entrance. The defender has a list of sensors (colored dots in the grid). The choice of a single sensor represents an action of the defendor. The sensors are part of the environment, i.e., the defender cannot change their position. The defender only selects one sensor to activate and this sensor remains activated throughout the whole game with no possibility to change this setting.

We model this scenario as a two-player zero-sum game in which the actions of attacker are all possible paths in the grid and the actions of the defender are all choices of a single sensor from the list. Thus, the attacker's mixed strategy is a probability distribution over a set of paths and the defender's mixed strategy is a probability distribution over the sensors. The task is to compute the Nash equilibrium in this game.

{{:courses:uir:hw:example_gallery.png?600|}}

\\

===Planning Oracle===
The planning oracle selects the best path given the probability over sensors. To this end, there are two functions in the ''planning_oracle.py''. First, ''cost_function'' returns a matrix of costs, one for each cell.
<code python>
def cost_function(gallery, sensor_distribution):
    """
    Method to calculate the cost of the cells in the grid given the current sensor distribution

    Parameters
    ----------
    gallery: Gallery
        Map of the environment with information about sensors of given size
    sensor_distribution: list(list(double), list(int))
        List of two lists. In the second list are the sensors, and the first list is the corresponding sensor's probability.
    Returns
    -------
    cost: Matrix of the same size as the map in the gallery
        Matrix filled by cost for each grid cell in the environment. (Costs are set in constants FREE_COST and SENSOR_COST.
        The cost for accessing a cell is FREE_COST + SENSOR_COST * probability of the sensor being active in the cell.)
    """
</code>

The first input, ''gallery'', encodes the environment. You can access ''gallery.map'', a (''y_size, x_size'')-sized matrix where ''0'' represents  a free cell, ''1'' a wall, ''2'' an entrance, and ''3'' a goal. You can find this encoding in ''constants.py''. Gallery also has the properties ''gallery.walls'', ''gallery.entrances'' and ''gallery.goals'', where the respective cells are listed. This redundant representation is provided for your convenience. Finally, ''gallery.sensor_coverage'' is a list of lists, where the main list has the size ''num_sensors'' and sublists can be of any size. Each sublist represents one sensor and contains all the cells the sensor covers.

The second input, ''sensor_distribution'', consists of two lists of the same size. The second list contains indexes of the sensors, and the first list contains probabilities for the corresponding sensor in the second list.

Your task is to create a matrix constaining cost of each cell, which are given by ''FREE_COST + SENSOR_COST * probability of a sensor'' being active in the cell.

The ''best_plan'' function takes ''gallery'' and the previously generated ''cost_matrix'' as inputs. In this function, compute a shortest path from an entrance to goal and back to some entrance given the costs in the cells. The function should return the value of the path you have found.

<code python>
def best_plan(gallery, cost_matrix):
    """
    Method to calculate the best path to the goal and back given the current cost function

    Parameters
    ----------
    gallery: Gallery
        Map of the environment with information about sensors of a given size
    cost_matrix: Matrix of the same size as the map in the gallery
        Matrix capturing the cost of each cell
    Returns
    -------
    path: list(tuple(int, int))
        List of coordinates visited on the best path
    value: double
        Value of the best path given the cost matrix
    """
</code>

\\

===Sensor Oracle===
The sensor oracle selects the best sensor against some attacker's path distribution. Besides, it also evaluates ''(sensor, path)'' action pairs.

Your task is to implement ''path_evaluation'', which takes ''gallery'' and ''path_distribution'' as inputs. The ''gallery'' is the same as in the planning oracle, ''path_distribution'' is a list of lists containing a list of attacker's paths probabilities, and a list of the respective attacker's paths. Compute the value of each sensor using the paths as if it were selected with probability 1.

<code python>
def path_evaluation(gallery, path_distribution):
    """
    Method to evaluate the cost of path distribution for each sensor

    Parameters
    ----------
    gallery: Gallery
        Map of the environment with information about sensors of given size
    path_distribution: list(list(double), list(tuple(int, int))
        List of two lists. The paths are in the second list, and in the first list are probabilities corresponding to those paths.
    Returns
    -------
    sensor_counts: list(double)
        List of the value of each sensor against given path distribution
    """
</code>

Next, implement ''best_sensor'', which has the same inputs as ''path_evaluation''. The function should return the best sensor's index and the value which the respective sensor achieves against the given path distribution.

<code python>
def best_sensor(gallery, path_distribution):
    """
    Method to pick the best sensor against the current path distribution

    Parameters
    ----------
    gallery: Gallery
        Map of the environment with information about sensors of given size
    path_distribution: list(list(double), list(tuple(int, int))
        List of two lists. The paths are in the second list, and in the first list are probabilities corresponding to those paths.
    Returns
    -------
    sensor: int
        Index of the sensor, which achieves the best value against the path distribution.
    reward: double
        Value achieved by the best sensor

    """
</code>


\\

===Double Oracle Algorithm (DOA)===
To run double oracle you need to fill the whole matrix whan adding new actions. To do so, implement ''pair_evaluation'', which takes 
''gallery'', ''path'', and ''sensor'' as its input.
The function returns the value of following the path given a selected sensor.

<code python>
def pair_evaluation(gallery, path, sensor):
    """
    Method to evaluate the value of path-sensor pair

    Parameters
    ----------
    gallery: Gallery
        Map of the environment with information about sensors of a given size
    path: list(tuple(int, int))
        List of coordinates forming the path.
    sensor: int
        Index of the sensor.
    Returns
    -------
    cost: double
        Cost of the path assuming the given sensor is active.
    """

    return 0
</code>

You will implement the DOA using the above oracles. Please consult paper [1] for details. In nutshell, the DOA incrementally builds the action spaces of both players and computes Nash equilibrium in smaller subgames of the original game.

<code python>
def double_oracle(gallery):
    """
    Method to compute optimal strategy for attacker and defender using double oracle algorithm and oracles implemented in previous steps

    Parameters
    ----------
    gallery: Gallery
        Map of the environment with information about sensors of a given size
    epsilon: double
        The distance between both player's best response values required to terminate the algorithm
    Returns
    -------
    sensor_strategy: list(double)
        Optimal strategy as a probability distribution over sensors
    sensors: list(int)
        List of sensors used as actions
    path_strategy: list(double)
        Optimal strategy as a probability distribution over paths
    paths: list(list(tuple(int, int)))
        List of all the paths used as actions
    """
</code>

The input is only the ''gallery'' for which you will compute strategies. And on the output are the strategies which are split into the distribution and action parts.

The DOA works as follows:

  - Pick an action for each player and compute the value of the pair of chosen actions, then construct a matrix game using the actions and payoff.
  - Compute Nash equilibrium of the matrix game using LP.
  - Compute the best response (BR) of each player against the current strategy computed by LP. Note that the BR is considered in the original game!
  - If the gap between the BR values is smaller than some small epsilon, end the algorithm and return the Nash equilibrium of the matrix game.
  - Otherwise, add the BRs to the current matrix game and repeat from step 2.

\\


Computing the equilibirum of the matrix game can be done by the following linear program.

\begin{align}
&\text{maximize} \quad U \\
&\text{subject to} \sum_{a_1 \in A_1}x(a_1)u_1(a_1,a_2) \geq U &\forall a_2 \in A_2 \\
& \quad \quad \quad \quad \sum_{a_1 \in A_1}x(a_1) = 1 \\
& \quad \quad \quad \quad x(a_1) \geq 0 &\forall a_1 \in A_1 \\
\end{align}

You will need LP solver for this task. Upload system should work with Gurobi and CPLEX (extensively tested with Gurobi, should probably work with CPLEX). Great guide for gurobi is available on Combinatorial optimization course pages - https://cw.fel.cvut.cz/b192/_media/courses/ko/01_gurobi.pdf

Make sure to not print information about each matrix solve as BRUTE might not finish when the standard output gets too large. In gurobi it is done by ''model.setParam("OutputFlag", 0)''

\\

==Example==
We will show the steps of DOA for the following gallery.

{{:courses:uir:hw:DOgallery.png?600|}}

In the first step, the defender chooses the blue sensor (B) and the attacker selects the following path indexed 0. This will create a utility matrix with only one element: 0.21

{{:courses:uir:hw:Path1.png?600|}}

Trivially, the Nash equilibrium of this game is Sensors: [1], Paths: [1], and the best response against the strategies is orange sensor (O) against the path and the same path against the sensor. The value of the path is 0.21, and the value of the sensor is 40.21, which is bigger than the small epsilon, and we add the sensor to the matrix. We will show how to compute the values in last step, where it is more complex. Now we repeat from step 2.

We compute the Nash equilibrium of the matrix:

^ ^ 0 ^
^ B | 0.21 |
^ O | 40.21 |

The equilibrium is Sensors: [0, 1], Paths: [1] and we compute the BRs. The best sensor against the path is still blue, which is already in our action pool, and the best path against orange sensor is the following, indexed 1, with a value of 0.21. We add the path to the action pool and compute values to fill the matrix. Against blue sensor path has a value of 40.21, and we repeat it again.


{{:courses:uir:hw:Path2.png?600|}}

We compute the Nash equilibrium of the matrix

^ ^ 0 ^ 1 ^
^ B | 0.21 | 40.21 |
^ O | 40.21 | 0.21 |

which is mixed for the first time, Sensors: [0.5, 0.5], Paths: [0.5, 0.5]. The best sensor against the uniform combination of the paths can be both blue and orange, and we choose blue, but it is already in the action pool. It has a value of 20.21 against the path combination. The new path against the combination of sensors is indexed 2, it has a value of 0.37, and it looks as follows:

{{:courses:uir:hw:Path3.png?600|}}

Creating utility matrix

^ ^ 0 ^ 1 ^ 2 ^
^ B | 0.21 | 40.21 | 0.37 |
^ O | 40.21 | 0.21 | 0.37 |

Solving this game can produce many equilibria, which would differ only in the probability of the sensors, so we pick one of them, say Sensors: [0.4, 0.6], Paths: [0, 0, 1]

The best responses against this strategy are green sensor (G), which will be added to the action pool, and path 2, which is already present. The sensor value is 20.37, and the path value is 0.37, so the gap is still present. We produce a new utility matrix.


^ ^ 0 ^ 1 ^ 2 ^
^ B | 0.21 | 40.21 | 0.37 |
^ O | 40.21 | 0.21 | 0.37 |
^ G | 0.21 | 0.21 | 20.37 |

When we solve the LP we get equilibrium Sensors: [0.252, 0.252, 0.496], Paths: [0.25, 0.25, 0.5]

Against this optimal sensor and optimal path, the best sensor is any of the sensors with a value of 10.29 and any of the selected paths with a value of 10.29 (we can find a different path, but all best paths will have the same value). The gap disappeared, and we a have Nash equilibrium of the full game (which has huge action space for the path planner, and we managed to generate only three actions from this space).

\\

==Computing the Values==

First, we will compute the value for the sensor. We will pick a blue sensor as in the example. Then for each possible path, we check how many spaces with the sensor the path crosses. For the blue it is **zero** for the path 0, **four** for path 1 and **zero** for path 2. This would give a value for the sensor and path pairs of 0.21, 40.21, and 0.37. (We can see that if the BR is already in our action space, we can use the values present in the utility matrix.) Then we do a dot product of the numbers with path strategy and we get ''0.21 * 0.25 + 40.21 * 0.25 + 0.37 * 0.5 = 10.29''.

Second, we will compute the value for the path. We pick path 2, which crosses the blue sensor **zero** times, orange sensor **zero** times and green sensor **twice**. This gives us values 0.37, 0.37, 20.37 and using a dot product with sensor strategy we get ''0.37 * 0.252 + 0.37 * 0.252 + 20.37 * 0.496 = 10.29''.
\\

===Reference===

[1] McMahan, H. Brendan, Geoffrey J. Gordon, and Avrim Blum. "Planning in the presence of cost functions controlled by an adversary." Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003.