Sequential decision processes (3rd assignment)

Download the updated kuimaze package kuimaze_mdp.zip.

Your task will be to implement the functions find_policy_via_value_iteration(…) and find_policy_via_policy_iteration(…) with the following inputs:

find_policy_via_value_iteration(problem, discount_factor, epsilon)
find_policy_via_policy_iteration(problem, discount_factor)

where:

problem is the environment, an object of the type kuimaze.MDPMaze
discount_factor is a number from the range (0,1)
epsilon is the maximum permitted error for the value of each state (only in the case of value iteration)

The expected output is a dictionary where the key word is a tuple (x,y) and the value is the optimal action (only the accessible states, in cases where the key is a terminal state it is enough return None).

You implement the methods in the file mdp_agent.py and upload it to the Upload system. There is no need to alter any other file.

You can look at mdp_sandbox.py in the downloaded kuimaze.zip package. It shows the basic work with MDPMaze and you can use it for inspiration.

Timeout: each run of value/policy iteration in a given instance of a problem is limited to at most 30s.

Points and deadline

The deadline is again in the Upload system.

Evaluation is divided into: * Automatic evaluation that check the correctness of the policy ( matching the correct and your actions with all the states) on a given sample of the environment, possibly checking different discount factors. * Manual evaluation for clean code.

Evaluated criteria	max	note
Algorithm quality - value iteration	2.5	Evaluated by AE.
Algorithm quality - policy iteration	2.5	Evaluated by AE.
Code quality	1	Comments, structure, cleanliness, …

Automatic evaluation:

match with policy 95% and more (average on n-tested mazes): 2.5 points
match with policy 90%-95% : 2 points
match with policy 85%-90% : 1.5 points
match with policy 80%-85% : 1 point
match with policy 70%-80% : 0.5 point
less than 70% match: 0 points

Code quality (1 point):

suitable comments, or clear code for understanding
reasonable length of methods/functions
properly named functions (verbs) and variables (nouns)
no repeated, no copy-pasted code
reasonable processing space/time
consistent naming convention used for all methods, variables
consistent and easily readable structure
…

You can follow PEP8, although we do not check all PEP8 demands. Most of the IDEs (certainly PyCharm) point out mishaps with regards to PEP8. You can also read some other sources for inspiration about clean code (e.g., here) or about idiomatic python (e.g., medium, python.net).

Basic methods

For the communication with the MDPMaze environment you can use the following methods:

get_all_states() : returns all the accessible states (i.e. without wall separated tiles). These states are weighted with rewards (state.x, state.y, state.reward).

is_terminal_state(state) : 'True' if the given state is a terminal state. (However, it does not differentiate between the positively valued terminal state or the negatively valued terminal state, simply if it is any terminal state).

get_actions(state) : For a given state it returns a list of possible actions. The result is enum, see an example in mdp_sandbox.py

get_state_reward(state): Usually not necessary, the input is the named tuple (state.x, state.y).

get_next_states_and_probs(state, action) : Returns for a given state and action a list of tuples and probabilities ¹⁾

¹⁾

state.x, state.y), probability); e.g.: [((2,3), 0.3), ((3,3), 0.2)] state is a named tuple, either (state.x, state.y, state.reward) or (state.x, state.y). visualise(dictlist=None) : without a parameter it visualizes the usual maze. Otherwise it expects a list of dictionaries in the form {'x': x_coord, 'y': y_coord, 'value: val'}. The value val can be either a scalar value or a list/tuple with four members. You can specifically visualize:

rewards or values assigned to individual states - see env.visualise(get_visualisation_values(utils))
policy - see env.visualise(get_visualisation_values(policy))

The previously encountered render(), reset() methods are also available. Once again, you can look at mdp_sandbox.py to see how to use these methods, but write your code into mdp_agent.py.

Table of Contents

Sequential decision processes (3rd assignment)

Points and deadline

Basic methods