===== T4b-mcts - Monte-Carlo Tree Search policy in pursuit-evasion ====== The main task is to implement a Monte-Carlo Tree Search (MCTS) policy for the robotics pursuit-evasion game. |**Deadline** | 3. January 2021, 23:59 PST | |**Points** | 6 | |**Label in BRUTE** | t4b-mc | |**Files to submit** | ''Player.py''| |**Resources** | resources from greedy assignment | \\ ===Assignment=== In file ''player/Player.py'' in function ''monte_carlo_policy'' implement the MCTS policy decision making for pursuit-evasion game. MCTS policy is a heuristics search algorithm for decision-making problems. The next-best state is selected in each discrete step of the game based on simulated playouts. The ''monte_carlo_policy'' function has the following prescription which follows the prescription: def monte_carlo_policy(self, gridmap, evaders, pursuers): """ Method to calculate the monte carlo tree search policy action Parameters ---------- gridmap: GridMap Map of the environment evaders: list((int,int)) list of coordinates of evaders in the game (except the player's robots, if he is evader) pursuers: list((int,int)) list of coordinates of pursuers in the game (except the player's robots, if he is pursuer) """ The purpose of the function is to internally update the ''self.next_robots'' variable, which is a list of ''(int, int)'' robot coordinates based on the current state of the game, given ''gridmap'' grid map of the environment and the player's role ''self.role''. The player is given the list ''evaders'' of all evading robots in the game other than his robots and the list of ''pursuers'' of all pursuing robots in the game other than his robots. I.e., the complete set of robots in the game is given as the union of ''evaders'', ''pursuers'' and ''self.robots''.\\ During the gameplay, each player is asked to update their intention for the next move coded in the ''self.next_robots'' variable by calling the ''calculate_step'' function. Afterward, the step is performed by calling the ''take_step'' function followed by the game checking each step, whether it complies with the rules of the game. The game ends after a predefined number of steps or when all the evaders are captured. In MCTS, each player has a predefined time for making the decision for a single robot given in the ''self.timeout'' variable. \\ ===Parts of MCTS=== In the **selection** phase you should traverse the tree from the root, selecting nodes using UCT formula $\text{UCT}(i) = \frac{w_i}{n_i} + c\sqrt{\frac{\log{N}}{n_i}}$ where $w_i$ is number of wins from the node after $i$th move, $n_i$ is number of simulations from the node after $i$th move. $N$ is the total number of simulations from the node, and $c$ is the exploration constant. The first part is the value of the considered move based on the simulations, and the second part is the exploration factor, which is higher for moves with fewer simulations. Then you continue with the move that has the highest value of UCT. Note that since we have simultaneous moves, we have to either pretend that one player moves and then the other one moves (using the opponent as the player that moves second creates a defensive solution) or select moves for both players in each node using two different UCT computation for two sets of moves and then join the moves together. In this case, it is also easier to check for the win in the game. The selection phase ends when we reach a node with a child with zero simulations done or reach the tree depth limit. If we reached the tree depth limit, we skip **expansion**. Otherwise, we add one of the children without simulations to the tree. In the **simulation** phase, we perform simulation from the selected/expanded node using some heuristic strategy. A random strategy is sometimes used. In our case, the epsilon-greedy strategy is a good heuristic for simulations. When the game ends by reaching max simulation depth or by a catch, we have to **backpropagate** the values through the nodes in the tree we visited, incrementing simulation count and the number of wins if we ended by a win. It is also possible to use discounting to prioritize moves that end the game faster. Few notes to improve the solution: * Not ignoring the same configurations in the tree can improve the precision of the action valuation. A possible solution is to hash the states in the tree. * Limiting the depth of the tree can improve the solution as it can increase the number of rollouts that you can make. * For the rollout, use the $\epsilon$-greedy policy you implemented. * Setting $c$ is important. You should check the number of trials for each move, for example, in the root node, and adjust $c$ so that it is not close to uniform and also not heavily biased towards a single move. * Start with a configuration where the evader has pursuers right next to him and check if it works (to check if you correctly detect catches in the search), and after solving those trivial cases, try to evade or catch from a distance. \\ ==Resources== MCTS ([[https://en.wikipedia.org/wiki/Monte_Carlo_tree_search|Monte Carlo tree search]])\\ UCB1 ([[http://ggp.stanford.edu/readings/uct.pdf|Bandit based Monte-Carlo Planning]])\\ Dealing with simultaneous moves ([[http://storage.kghost.de/cig_proc/full/paper_61.pdf|Monte Carlo Tree Search Variants for Simultaneous Move Games]])\\ RAVE ([[http://www.cs.utexas.edu/~pstone/Courses/394Rspring13/resources/mcrave.pdf|Monte-Carlo Tree Search and Rapid Action Value Estimation in Computer Go]]) \\ ===Evaluation=== Your agents should be able to catch the ''GREEDY'' evader in all the game scenarios provided, given enough steps. And in the scenarios where your agent is the evader, it should be able to escape. You can easily generate new game setups by modifying the ''.game'' files accordingly. In the upload system, the student's solutions are tested against the teacher's ''GREEDY'' policy players on private game scenarios. {{:courses:b4m36uir:hw:example_evader_greedy.gif?450|Greedy evader}} {{:courses:b4m36uir:hw:example_evader_mc.gif?450|Monte Carlo evader}} Comparison of Monte Carlo evader and greedy evader against greedy pursuers. {{:courses:b4m36uir:hw:example_pursuers_greedy.gif?450|Greedy pursuers}} {{:courses:b4m36uir:hw:example_pursuers_mc.gif?450|Monte Carlo pursuers}} Comparison of Monte Carlo pursuers and greedy pursuers against greedy evader.