11. Stochastic shortest path (SSP)

In this lab, we will solve an SSP problem using the algorithm ILAO*, which generalizes A*. Suppose we have the SSP problem defined by the following graph.

It is a problem of what way of transport one should take when traveling to work to minimize the expected travel time. The actions are depicted as black dots with their names and the time they take. If the action has several probabilistic outcomes, the particular probabilities label the respective outgoing edges. “home” is the initial state, and “work” is the only goal state.

We will solve the problem using the heuristic search algorithm ILAO*. For that, we need an admissible heuristic $h$. We can, for instance, define $h(s)$ as the time of the shortest path from $s$ to the goal state if we represent each nondeterministic action as a collection of deterministic ones.

Exercise 1: Compute the heuristic $h$ for each state.

Solution

To make the following computation more interesting, we will consider another less-informed heuristic:

state home waiting train light medium heavy work
$h$ $21$ $38$ $35$ $20$ $20$ $20$ $0$

Note that we cannot distinguish between the traffic levels as the heuristic always predicts $20$ for the states light, medium, and heavy.

Recall that ILAO* maintains several data structures:

  • Envelope (all seen states) divided into internal (already expanded states) and fringe (not yet expanded states)
  • Value function $V$, for the fringe states we define $V(s)=h(s)$. For the goal states $s$ we have $V(s)=0$.
  • Current solution graph $G$ (the states accessible from the initial state by the current greedy policy w.r.t. $V$)

It works as follows:

  1. Build the current solution graph $G$
  2. Perform VI on $G$ until the residual $res\leq\varepsilon$ or the greedy policy $\pi$ changed. We set $\varepsilon=0.1$.
  3. If $\pi$ changed goto 1
  4. Otherwise, return $\pi$

The solution graph $G$ is a subset of nodes making current greedy policy $\pi$ closed, i.e., $\pi(s)\in G$ for all $s\in G$. $G$ is built using a depth-first search, i.e., it maintains a stack of non-goal states to be expanded and remembers the set of visited states to avoid cycles. When a state $s$ is expanded, we take its successors $succ(a)$ for each applicable action $a$ and compute the greedy action. First, we compute \[Q(s,a)=\min_{a\in A} cost(a) + \sum_{t\in succ(a)} P(t|s,a)\cdot V(t).\] Next, we define \[\pi(s)=\mathrm{argmin}_{a\in A} Q(s,a).\] Only non-goal states in $succ(\pi(s))$ are added to the stack.

Exercise 2: Built the first solution graph $G$.

Solution

Exercise 3: Perform VI on $G$.

Solution

Exercise 4: Build the new solution graph $G$.

Solution

Exercise 5: Perform VI on $G$.

Solution

courses/pui/tutorials/tutorial11.txt · Last modified: 2024/04/29 08:06 by xhorcik