====== Assignment 2 - Wumpus World ====== ===== News ===== * May 28, 2017: Top five performers are Bouška (562), Eichler (475), Marek (463), Drtina (459), and Dlask (457). * May 18, 2017: Automatic evaluation for Task 2 and Task 3 is working now. Note that the map used for evaluation is different than the one used in the solution template. * May 9, 2017: Automatic evaluation for Task 1 is working now. Note that the map used for evaluation is different than the one used in the solution template. * May 9, 2017: If the value of the initial state and the average reward obtained from simulation (use at least 100 simulation runs to estimate the average reward) is roughly 68, then your solution is most likely correct and you can upload it to the upload system. * May 5, 2017: Due to some technical issues, automatic evaluation for Task 1 is not working yet in the upload system. We are currently resolving the issue and will let you know once the automatic evaluation starts working. ===== FAQ ===== Q: How does the testing map look? A: It is the same size as the public map, it has the same gold and obstacle ratio. It only differs in the seed used to generate the map. Q: What average utility do I need to achieve to get 7 points? A: You can get 1, 4, or 7 points in Task 2 and Task 3. The thresholds are identical in both assignments. When you achieve avg. utility > 0, you get 1 point. When you achieve avg. utility > 250, you get 4 points. When you achieve avg. utility > 385, you get 7 points. ===== Description ===== Implement decision making routine for an agent in a Wumpus world. The assignment consists of three tasks that will be handed out in parts. | Task | Handed-out | Deadline | Max. Points | | Task 1 | May 2, 2017 | May 13, 2017 | 6 | | Task 2 | May 11, 2017 | May 25, 2017 23:59 | 7 | | Task 3 | May 11, 2017 | May 25, 2017 23:59| 7 | The assignment is considered as "successfully completed" if you get at least 10 points. If you have any questions, you can use [[https://cw.felk.cvut.cz/forum/thread-3465.html| Assignment 2 Discussion Forum]]. ===== Tasks ===== === Task 1 (Value Iteration) === Implement decision making for an agent in Wumpus world environment using Value Iteration algorithm. The agent must act optimally over a finite horizon h=200. For more info see e.g. {{http://www.cs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf}}, page 8. We will call wumpus.Agent.nextStep() to execute the policy. Any auxiliary code should be in ''wumpus.agent'' package. To run simulation, execute ''WumpusWorldCreator'' class. {{:courses:a4m36pah:assignments:wumpus-task1.zip|Solution template}} (updated May 2, 14:38 -- removed Maven for better compatbility) Your IDE must be configured to use the jar file in ''/lib'' as a dependency. == Implementation == You can use ''WorldModel.getTransitions()'' to obtain all transition (along with the transition probability and the reward received) that may occur when a particular action is applied in a particular state. == Debug Visualization == In order to assist you with debugging, we have prepared three visualization layers: * ''State-Value'' Layer is activated by pressing “s” and shows a float value for each cell of the environment. The data to be visualized can be set using ''DebugVis.DebugVis.setStateValues(float[][] stateValues)''. * ''State-Action-Value'' Layer is activated by pressing “a” and shows a float value for each move from each cell. The data to be visualized can be set using ''DebugVis.setStateActionValues(float[][][] stateActionValues);'' The ''stateActionValues'' array has three dimensions: * x-coordinate of cell * y-coordinate of cell * direction (0 = NORTH,1 = SOUTH,2 = EAST,3 = WEST) * Policy Layer is activated by pressing “p”. It shows an arrow indicating the optimal action at each state (cell). The data to be visualized can be set using ''DebugVis.setPolicy(Action[][] policy);'' === Task 2 and Task 3 (MCTS) === Implement decision making for an agent in Wumpus world environment using Monte-Carlo Tree Search algorithm. The agent should attempt to maximize the cummulative reward over the given horizon. For more info about MCTS in context of MDP planning see e.g. {{https://courses.cs.washington.edu/courses/cse573/12au/slides/07-mcp.pdf}}, slides 19-72. There are many different flavors of Monte-Carlo planning techniques for MDPs and you can chose any of them. For example, you can try single-state Monte-Carlo with random or other baseline policy rollouts, sparse-sampling, adaptive monte-carlo tree search with UCB or epsilon-greedy exploitation-exploration policy or any other instantiation of THTS algorithmic scheme. Please download the {{:courses:a4m36pah:assignments:wumpus-task23.zip|Solution template}} * **Updated May 17, 2017**. Bug fix: WorldModel.performAction() returned samples non-deterministically even when it was given random number generator with fixed seed due to non-deterministic implementation of HashSet in Java. Now, the method returns samples deterministically. * Updated May 15, 2017. Bug fix: Agent object was not reinitalized before a simulation run. Now, a new ''Agent'' object is constructed at the beginning of each simulation run. Your IDE must be configured to use the jar file in ''/lib'' as a dependency. To run simulation, execute ''Task2Creator'' and ''Task3Creator'' class for Task2 and Task 3 respectively. Your task is to implement action selection mechanism within wumpus.agent.Agent.nextStep() method such that the agent attempts to maximize the sum of rewards received in the Wumpus world over the horizon of 50 steps and 100 steps in Task 2 and Task 3 respectively. The solution should consists of wumpus.agent.Agent class and any other supporting classes that however need to be in ''wumpus.agent'' packages or some subpackage of ''wumpus.agent''. The simulator will call wumpus.agent.Agent.nextStep() at each step to obtain an action that the agent executes at the particular state. The agents for Task 2 and Task 3 can be identical or different, whatever you find more appropriate. In contrast to Task 1, in both Task 2 and Task 3, the simulation continues when the robot reaches gold. On top of that, in Task 3, there are several wumpuses moving randomly in the world, depicted as red circles. When a wumpus steps on the same cell as the agent, the agent dies (reward: -99) and the simulation ends. == Implementation == You can use ''WorldModel.performAction(s, a, rnd)'' to simulate the outcome of action ''a'' being executed in state ''s'' using random number generator ''rnd''. The random number generator should be initialized at the beginning of each simulation run as ''rnd = new Random(seed)''. If ''seed'' is a fixed constant, then the random number generator will generate the same sequence of random numbers on each program execution, which is handy for debugging. == Debug Visualization == * State-Label Layer is activated by pressing “l” and shows a the coordinates of each cell of the environment. == Runtime limits == Automatic evaluation scripts cannot run longer than 15 minutes. There will be 15 simulations for Task 2 (with 50 decision points during each simulation), and 10 simulations for Task 3 (with 100 decision points during each simulation). Thus, please limit the time to make a decision at each step of simulation to * 1.15 second for Task 2 * 0.85 second for Task 3 == Performance expectations == In order to get full points for the assignment, your agent should be "reasonably" efficient. If your agent achieves average utility of > 500 in Task 2 and the algorithm generalizes also to different maps, you can expect to get 7 points for Task 2 by AE. Similarly, if your algorithm achieves utility of > 600 in Task 3, you can expect to get 7 points for Task 3 by AE. == Bonus points == Top five solution for Task 3 (in terms of utility achieved in AE) will get 4 extra bonus points. ===== Submitting ===== Submit your solution through [[https://cw.felk.cvut.cz/upload/|CW Upload system]] before the deadline as a single zip file. At the top-level of the zip file should be ''src'' directory that contains the source code. That is, the contents of the archive should have the following structure: * src * wumpus * agent * ... (your Java source files) * world * ... (our Java source files) * WumpusWorldCreator.java ===== Environment ===== * Robot can execute following actions with stochastic effects (class ''Action''): * NORTH -- Actual effect: 80% NORTH, 10% EAST, 10% WEST * SOUTH -- Actual effect: 80% SOUTH, 10% EAST, 10% WEST * EAST -- Actual effect: 80% EAST, 10% NORTH, 10% SOUTH * WEST -- Actual effect: 80% WEST, 10% NORTH, 10% SOUTH * If the robot executes a move action that would end up in an obstacle, it bounces back to its current-position. * The environment is a matrix MxN, where the first index represents columns (x-coordinate) and the second index represents rows (y-coordinate). The columns (rows) are indexed starting from 0, i.e. we have columns (rows) 0,1,...,M-1 (N-1). * Each cell can contain (class ''CellContent''): * EMPTY * OBSTACLE * GOLD * PIT * The simulation finishes if the agent reaches the gold (only in Task 1), falls into a pit, or after h steps. ==== Rewards/Penalty ==== * -1 at each move action * -100 + (-1) for action that results in a pit * -100 + (-1) when a wumpus moves to the same cell as the agent * 100 + (-1) for action that reaches the gold ===== Tips & Tricks ===== * Use ''javax.vecmath.Point2i'' class to represent (x,y) pair of integers. * You can speed-up/slow-down the simulation on line 101 in ''WumpusWorldCreator.java'' by changing the third parameter of simulate method, which represents the delay between two actions in miliseconds. * Numerical value of an action can be obtained as ''Action.ordinal()''. I.e. ''Action.NORTH.ordinal() == 0''.