Table of Contents

3. Markov Decision Processes

Your task is to implement the value iteration and policy iteration methods to find the optimal strategy (policy) for the given MDP.

Specification

In the module mdp_agent.py, implement two classes:

The interface of both classes is identical, both must implement the following methods:

method input parameters output parameters explanation
__init__ env: MDPProblem, gamma: float, epsilon: float none Agent initialization.
find_policy none Policy Returns the optimal strategy, i.e., a dictionary of pairs (state, action).

How to

  1. We recommend creating a new working directory for the task. Set up an updated version of the kuimaze2 package in it.
  2. Familiarize yourself with the MDPProblem environment.
  3. In the kuimaze2 package, you will also find the script example_mdp.py, which also shows how to work with the environment. It can be used as a starting code for the implementation of both classes.
  4. It is quite possible that both classes will have some common parts. In such a case, we recommend (as indicated in example_mdp.py) to extract shared parts into a common ancestor of both classes:
    class MDPAgent:
        # Parts common to both methods/agents
        ...
     
    class ValueIterationAgent(MDPAgent):
        # Parts specific for value iteration
        ...
     
    class PolicyIterationAgent(MDPAgent):
        # Parts specific for policy iteration
        ...

Submission

Evaluation

Learn about evaluation and scoring of the task.