Task 3: Playing Two-Player Game with MCTS

Five-in-a-row

Five-in-a-row (also called Gomoku) is a simple popular game played on a square grid. Players alternate in placing their mark (x or o) on an empty cell. The winner is the first player to get an unbroken row of exactly five marks horizontally, vertically, or diagonally.

An example terminal state – the winning trace is highlighted by uppercase letters X:

.Xo.....
oXxx....
oXoo.x..
.Xo.....
.Xo.....
.....o..
..x.....
........

Your Task

Download task3_mcts.zip.

Your task is to implement MCTS algorithm to play the game efficiently within the time limit of 1 second per move. Use a custom heuristic pseudo-random simulation. Add a preference for the UCB bandit to explore more promising moves, based on the rules of the game.

In student.py, you should implement the following:

class MCTSBot:
    def __init__(self, play_as: int):
        self.play_as = play_as
 
    def play_action(self, state: State):
        # TODO: implement MCTS bot according to the assignment task.
        return np.random.choice(state.legal_actions())

You are given an example implementation of the game, code for interaction with the opponent player and a sample random bot (as above). From standard input, the program receives the size of the game board, the player to play as, and an alternating sequence of actions of the opponent after your bot makes and prints an action.

The evaluation keeps track of the current board, so it is important that you do not change code outside of MCTSBot in your submission, to make sure yours and opponent's play is indeed consistent.

There is a time limit of 1 seconds per move. To make sure that you fit within the time limit, you can track the elapsed time with Python's package time and the function time.time(). The time limit is strict – your program will be shut down after the limit.

It is up to you how you solve the time management, but if you do it simply as printing the action after some time is over, it is preferable that you set an internal time limit that is below the hard deadline (maybe 0.9sec), so that there is still large enough window of time before the deadline.

Value function hint: As the game outcomes are +-1 for win/loss, when you use a value function, make sure that the heuristic values are within the open interval of (-1, 1) so the algorithm has preference for choosing real game outcomes rather than the heuristic ones, given a choice between the two in a search tree. This can be done using a function that “saturates” outputs at -1 and 1 for any inputs x, such as +- 2*sigmoid(x)-1

Evaluation

Upload your solution (only student.py) to Brute, where it will be evaluated automatically.

The algorithm will play a number of matches against a portfolio of opponents of various strength.

Based on the score against the opponent, you will receive 1 or 2 points, if the score is higher than 2 or 5 respectively.

If timeout occurs more than 2 times against an opponent, you will receive zero points for that scenario. The matches with the timeout have a zero score. In case any inconsistent play is detected or an error occurs, you will receive zero points for that scenario.

courses/b4b36zui/tasks/task3-2021.txt · Last modified: 2021/05/10 18:59 by sustrmic