===== 11. Stochastic shortest path (SSP) =====

In this lab, we will solve an SSP problem using the algorithm ILAO*, which generalizes A*. Suppose we have the SSP problem defined by the following graph.

{{ :courses:pui:tutorials:ssp.png?800 |}}

It is a problem of what way of transport one should take when traveling to work to minimize the expected travel time. The actions are depicted as black dots with their names and the time they take. If the action has several probabilistic outcomes, the particular probabilities label the respective outgoing edges. "home" is the initial state, and "work" is the only goal state.

We will solve the problem using the heuristic search algorithm ILAO*. For that, we need an admissible heuristic $h$. We can, for instance, define $h(s)$ as the time of the shortest path from $s$ to the goal state if we represent each nondeterministic action as a collection of deterministic ones. 

**Exercise 1:** Compute the heuristic $h$ for each state.

++++ Solution|
^ state ^ home ^ waiting ^ train ^ light ^ medium ^ heavy ^ work ^
| $h$ | $21$ | $38$ | $35$ | $20$ | $30$ | $70$ | $0$ |
++++

To make the following computation more interesting, we will consider another less-informed heuristic:
^ state ^ home ^ waiting ^ train ^ light ^ medium ^ heavy ^ work ^
| $h$ | $21$ | $38$ | $35$ | $20$ | $20$ | $20$ | $0$ |
Note that we cannot distinguish between the traffic levels as the heuristic always predicts $20$ for the states light, medium, and heavy.

Recall that ILAO* maintains several data structures:
  * Envelope (all seen states) divided into internal (already expanded states) and fringe (not yet expanded states)
  * Value function $V$, for the fringe states we define $V(s)=h(s)$. For the goal states $s$ we have $V(s)=0$.
  * Current solution graph $G$ (the states accessible from the initial state by the current greedy policy w.r.t. $V$)
 
It works as follows:
  - Build the current solution graph $G$
  - Perform VI on $G$ until the residual $res\leq\varepsilon$ or the greedy policy $\pi$ changed. We set $\varepsilon=0.1$.
  - If $\pi$ changed goto 1
  - Otherwise, return $\pi$ 

The solution graph $G$ is a subset of nodes making current greedy policy $\pi$ closed, i.e., $\pi(s)\in G$ for all $s\in G$. $G$ is built using a depth-first search, i.e., it maintains a stack of non-goal states to be expanded and remembers the set of visited states to avoid cycles. When a state $s$ is expanded, we take its successors $succ(a)$ for each applicable action $a$ and compute the greedy action. First, we compute
\[Q(s,a)=\min_{a\in A} cost(a) + \sum_{t\in succ(a)} P(t|s,a)\cdot V(t).\] 
Next, we define
\[\pi(s)=\mathrm{argmin}_{a\in A} Q(s,a).\]
Only non-goal states in $succ(\pi(s))$ are added to the stack. 

**Exercise 2:** Built the first solution graph $G$.

++++ Solution|
We start in the initial state "home" and expand it. There are three applicable actions: railway, car, and bike. Applying the actions, we get the following successors:
^ action ^ railway ^ car ^ bike ^ 
| $succ$ | waiting, train | light, medium, heavy | work |
| $Q(s,a)$ | $37.3$ | $21$ | $45$ | 

 The $Q$-value for the railway is computed as $2+0.9\cdot 35+0.1\cdot 38=37.3$.
 The $Q$-value for the car is computed as $1+0.1\cdot 20+0.6\cdot 20 + 0.3\cdot 20=21$.
 The $Q$-value for the bike is computed as $45+1\cdot 0=45$.
Thus, the greedy action is the car having the least $Q$-value. We expand the successors of the car, i.e., light, medium, and heavy. They all
have only a single deterministic action leading to the goal state. Thus, the solution graph $G$ consists of the internal states: home, light, medium, heavy, and fringe states: work, train, waiting.
++++

**Exercise 3:** Perform VI on $G$.

++++ Solution|
VI iteratively does the Bellman backups on the states in $G$. It is best to perform the backups in the post-order traversal as the values $V(s)$ 
are propagated backward to the initial state. For $G$, a single iteration suffices as there are no cycles.

^ state ^ work ^ light ^ medium ^ heavy ^ home ^
| $V(s)$ | $0$ | $20$ | $30$ | $70$ | $37.3$ |  

We have $V(work)=0$ as work is the goal. For light, we have $V(light)=20+1\cdot V(work)=20$. Similarly, for medium and heavy. For the home,
we must again consider all applicable actions and compute their $Q$-vlaues:
^ action ^ railway ^ car ^ bike ^ 
| $succ$ | waiting, train | light, medium, heavy | work |
| $Q(s,a)$ | $37.3$ | $42$ | $45$ | 

where $Q(home,car)=1+0.1\cdot V(light)+0.6\cdot V(medium)+0.3\cdot V(heavy)=1+2+18+21=42$. Thus $V(home)=37.3$.

The greedy policy changed, namely $\pi(home)=railway$. So, we must rebuild the solution graph $G$.
++++

**Exercise 4:** Build the new solution graph $G$.

++++ Solution|
The greedy action railway generates states train and waiting. The action train has only a single deterministic action leading to work.
On the other hand, the train has two applicable actions.

^ action ^ wait ^ go back ^ 
| $succ$ | waiting, train | home | 
| $Q(s,a)$ | $38.3$ | $39.3$ | 

where $Q(waiting, wait)=3+0.9\cdot h(train)+0.1\cdot h(waiting)=3+0.9\cdot 35+0.1\cdot 38=38.3$ and
$Q(waiting,go back)=2+1\cdot V(home)=2+37.3=39.3$. Thus, wait is the greedy action, i.e., $\pi(waiting)=wait$.
The action wait leads to already visited states: train and waiting. Thus, the solution graph $G$ consists of 
the internal states: home, train, waiting, and fringe states: work, light, medium, heavy.
++++

**Exercise 5:** Perform VI on $G$.

++++ Solution|
In the first iteration, we get

^ state ^ work ^ train ^ waiting ^ home ^
| $V(s)$ | $0$ | $35$ | $38.3$ | $37.33$ |  

where $V(train) = 35 + 1\cdot V(work)=35$, $V(waiting)=3+0.9\cdot V(train)+0.1\cdot V(waiting)=3+0.9\cdot 35+0.1\cdot 38=38.3$ and 
$V(home)=\min_{a\in A} Q(s,a)=37.33$ where

^ action ^ railway ^ car ^ bike ^ 
| $succ$ | waiting, train | light, medium, heavy | work |
| $Q(s,a)$ | $37.33$ | $42$ | $45$ | 

$Q(home, railway) = 2+0.9\cdot V(train)+0.1\cdot V(waiting)=2+0.9\cdot 35+0.1\cdot 38.3=37.33$

In the second iteration, we update only waiting and home.
^ state ^ work ^ train ^ waiting ^ home ^
| $V(s)$ | $0$ | $35$ | $38.33$ | $37.333$ |  

where $V(waiting)=3+0.9\cdot V(train)+0.1\cdot V(waiting)=3+0.9\cdot 35+0.1\cdot 38.3=38.33$ and 
$V(home) = Q(home, railway) = 2+0.9\cdot V(train)+0.1\cdot V(waiting)=2+0.9\cdot 35+0.1\cdot 38.33=37.333$.
Note that all the values $V(s)$ for $s\in G$ changed by less than $\varepsilon=0.1$. Thus, we finish VI and return the greedy policy $\pi$:

^ state ^ home ^ waiting ^ train ^
| $\pi$ | railway | wait | relax |

The expected travel time is approx. $37.333$.
++++