====== Sequential decision processes (3rd assignment)  ======
Download the updated kuimaze package {{ :courses:b3b33kui:kuimaze:kuimaze.zip |}} (Updated 2023.03.21).

Your task will be to implement the functions ''find_policy_via_value_iteration(...)'' and ''find_policy_via_policy_iteration(...)'' with the following **inputs**:

  * ''find_policy_via_value_iteration(problem, discount_factor, epsilon)''
  * ''find_policy_via_policy_iteration(problem, discount_factor)''

where:

  * ''problem'' is the environment, an object of the type ''kuimaze.MDPMaze''
  * ''discount_factor'' is a number from the range ''(0,1)''
  * ''epsilon'' is the maximum permitted error for the value of each state (only in the case of value iteration)

The expected **output** is a dictionary where the key word is a tuple (x,y) and the value is the optimal action (only the accessible states, in cases where the key is a terminal state it is enough return ''None''). 

You implement the methods in the file ''mdp_agent.py'' and upload it to the [[https://cw.felk.cvut.cz/upload/|Upload system]]. **There is no need to alter any other file.**

You can look at ''mdp_sandbox.py'' in the downloaded kuimaze.zip package. It shows the basic work with MDPMaze and you can use it for inspiration.

Timeout: each run of value/policy iteration in a given instance of a problem is limited to at most 30s.


====== Points and deadline ======
The deadline is again in the [[https://cw.felk.cvut.cz/upload/|Upload system]]. 

Evaluation is divided into:
  * Automatic evaluation that check the correctness of the policy ( matching the correct and your actions with all the states) on a given sample of the environment, possibly checking different discount factors.
  * Manual evaluation for clean code.

^ Evaluated criteria ^ min ^ max ^ note ^
| Algorithm quality - value iteration | 0 | 2.5 | Evaluated by AE. |
| Algorithm quality - policy iteration | 0 | 2.5 | Evaluated by AE. |
| Code quality | 0 | 1 | Comments, structure, cleanliness, ... |


Automatic evaluation:
  * match with policy 95% and more (average on n-tested mazes): 2.5 points
  * match with policy 90%-95% : 2 points
  * match with policy 85%-90% : 1.5 points
  * match with policy 80%-85% : 1 point
  * match with policy 70%-80% : 0.5 point
  * less than 70% match: 0 points

Code quality (1 point):
  * suitable comments, or clear code for understanding
  * reasonable length of methods/functions
  * properly named functions (verbs) and variables (nouns)
  * no repeated, no copy-pasted code
  * reasonable processing space/time
  * consistent naming convention used for all methods, variables
  * consistent and easily readable structure
  * ...

You can follow [[https://www.python.org/dev/peps/pep-0008/|PEP8]], although we do not check all PEP8 demands. Most of the IDEs (certainly PyCharm) point out mishaps with regards to PEP8. You can also read some other sources for inspiration about clean code (e.g., [[https://github.com/zedr/clean-code-python|here]]) or about idiomatic python (e.g., [[https://medium.com/the-andela-way/idiomatic-python-coding-the-smart-way-cc560fa5f1d6|medium]], [[http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html|python.net]]).

===== Provided methods and functions =====

Description of the variable ''state'':
   * If ''state'' is an input to a function or method, it can either be a pair of positive integers (x, y), or an instance of the ''kuimaze.State'' class, which is a [[https://docs.python.org/3/library/collections.html#collections.namedtuple|named tuple]], and allows access to the coordinates as ''state.x'' and ''state.y''.
   * If a ''state'' is returned by a function, it'll always be an instance of the ''State'' class, which you can treat in your code as a pair of values (x, y)

For the communication with the ''MDPMaze'' environment you can use the following methods:

''get_all_states()'' : returns all the accessible states (i.e. without wall separated tiles). These states are instances of the ''State'' class.

''is_terminal_state(state)'' : returns 'True' if the given state is a terminal state. (However, it does not differentiate between the positively valued terminal state or the negatively valued terminal state, simply if it is any terminal state).

''get_reward(state)'': Returns the reward for the given state. The reward is obtained only when leaving the state, not when it is reached.

''get_actions(state)'' : For a given state, returns an [[https://docs.python.org/3.4/library/enum.html|enum]] of all possible actions, see an example in ''mdp_sandbox.py'', or in the examples below. To get a list of possible actions, you can use ''list(get_actions(state))''

''get_next_states_and_probs(state, action)'' : For a given state and action, returns the list of pairs (<State>, probabilities) corresponding to future possible states and the probability to end up in each of them; ; eg [(State(x=1, y=0), 0.8), (State(x=2, y=0), 0.1), (State(x=0, y=0), 0.1)]

''visualise(dictlist=None)'' : Without a parameter it visualizes the usual maze. Otherwise it expects a list of dictionaries in the form  ''{'x': x_coord, 'y': y_coord, 'value: val'}''. The value ''val'' can be either a scalar value or a list/tuple with four elements. You can specifically visualize:
  - rewards or values assigned to individual states - see ''env.visualise(get_visualisation_values(utils))''
  - policy - see ''env.visualise(get_visualisation_values(policy))''
The previously encountered ''render(), reset()'' methods are also available.

Once again, you can look at ''mdp_sandbox.py'' to see how to use these methods, but write your code into ''mdp_agent.py''.

===== Examples of use =====

Create a simple maze map:
<code python>
>>> EMPTY = (255, 255, 255)
>>> WALL = (0, 0, 0)
>>> GOAL = (255, 0, 0)
>>> START = (0, 0, 255)

>>> MAP1 = ((EMPTY, START, EMPTY, EMPTY, EMPTY),
            (EMPTY, WALL,  WALL,  WALL,  EMPTY),
            (EMPTY, EMPTY, EMPTY, WALL,  GOAL))
</code>

Import the kuimaze package and the ''State'' class:
<code python>
>>> import kuimaze
>>> from kuimaze import State
</code>

Creating an environment, deterministic at first:
<code python>
>>> env = kuimaze.MDPMaze(MAP1)
</code>

If we want to create an intermediate non-deterministic environment (and we usually do in the case of MDP), we need to specify the transition probabilities:
<code python>
env2 = kuimaze.MDPMaze(MAP1, probs=(0.8, 0.1, 0.1, 0.0))
</code>

List of all valid states in the environment:
<code python>
>>> env.get_all_states()
[(x=0, y=0),
 (x=0, y=1),
 (x=0, y=2),
 (x=1, y=0),
 (x=1, y=2),
 (x=2, y=0),
 (x=2, y=2),
 (x=3, y=0),
 (x=4, y=0),
 (x=4, y=1),
 (x=4, y=2)]
</code>

Determining if a state is terminal:
<code python>
>>> env.is_terminal_state(State(0, 0)), env.is_terminal_state(State(4, 2))
(False, True)
</code>

What rewards are associated with each state? Note that rewards are obtained when leaving the state.
<code python>
>>> env.get_reward(State(1,0)), env.get_reward(State(4,2))
(-0.04, 1.0)
</code>

What actions are allowed in the state? In our environment, all 4 actions are always allowed, but the agent hits a wall, it'll stay in the current state.
<code python>
>>> actions = tuple(env.get_actions(State(1,0)))
>>> actions
(<ACTION.UP: 0>, <ACTION.RIGHT: 1>, <ACTION.DOWN: 2>, <ACTION.LEFT: 3>)
</code>

To which states, and with which probabilities, can I get if I perform the given action in the current state?
<code python>
>>> env.get_next_states_and_probs(State(1,0), actions[0])
[((x=1, y=0), 1), ((x=2, y=0), 0), ((x=0, y=0), 0)]
</code>
In a non-deterministic environment, the result will be different:
<code python>
>>> env2.get_next_states_and_probs(State(1,0), actions[0])
[((x=1, y=0), 0.8), ((x=2, y=0), 0.1), ((x=0, y=0), 0.1)]
</code>