===== 12. Heuristics for SSP =====

This tutorial will consider the factored representation of SSPs, i.e., probabilistic FDR tasks. It is the same representation as FDR, besides the fact that we can have probabilistic effects. In other words, each action can have several effects and a probability distribution over these effects. As mentioned during the lectures, to get an admissible heuristic for such an SSP, one can first turn it into a usual FDR task by all-outcome determinization and then compute any admissible heuristic on this FDR task. Such heuristics completely disregard the probabilistic information
we have. On the other hand, we will focus on the pattern databases for SSPs, which consider probabilistic information.

Consider a probabilistic FDR $\Pi=\langle V,A,s_0,g\rangle$ modeling a course exam that we want to pass with minimalistic effort where 
  * $V=\{tries, level, grade, done\}$ 
    * $\mathsf{dom}(tries)=\{0,1,2,3\}$, expressing the number of tries we have to pass the exam, 
    * $\mathsf{dom}(level)=\{0,1,2,3\}$, expressing the knowledge level; the higher the number is, the better our knowledge is,
    * $\mathsf{dom}(grade)=\{A,B,C,D,E,F,\bot\}$, $\bot$ means that we have not a grade yet,
    * $\mathsf{dom}(done)=\{T,F\}$, $T$ means that we pass and $F$ not passed,
  * $s_0=\{tries=3,level=0,grade=\bot,done=F\}$,
  * $g=\{done=T\}$,
  * $A=\{study_i\mid i\in\{0,1,2\}\}\cup \{exam_{ij}\mid i\in\{0,1,2,3\}, j\in\{1,2,3\}\}\cup\{repeat\}$.

^ action ^ pre ^ eff ^ cost ^
| $study_i$ | $level=i$ | $level=i+1$ | $10$ |
| $repeat$ | $tries=0$ | $tries=2,grade=\bot$ | $100$ |
| $exam_{0j}$ | $level=0, tries=j$ | $0.1:\ tries=j-1, grade=E, done=T\\ 0.9:\ tries=j-1, grade=F, done=F$ | $4$ |
| $exam_{1j}$ | $level=1, tries=j$ | $0.1:\ tries=j-1, grade=C, done=T\\ 0.2:\ tries=j-1, grade=D, done=T\\ 0.3:\ tries=j-1, grade=E, done=T\\ 0.4:\ tries=j-1,grade=F, done=F$ | $3$ |
| $exam_{2j}$ | $level=2, tries=j$ | $0.1:\ tries=j-1, grade=B, done=T\\ 0.2:\ tries=j-1, grade=C, done=T\\ 0.3:\ tries=j-1, grade=D, done=T\\ 0.2:\ tries=j-1,grade=E, done=T\\ 0.2:\ tries=j-1, grade=F, done=F$ | $2$ |
| $exam_{3j}$ | $level=3, tries=j$ | $0.2:\ tries=j-1, grade=A, done=T\\ 0.3:\ tries=j-1, grade=B, done=T\\ 0.2:\ tries=j-1, grade=C, done=T\\ 0.2:\ tries=j-1,grade=D, done=T\\ 0.1:\ tries=j-1, grade=E, done=T$ | $1$ |

The action $study_i$ is deterministic and increases our knowledge level $i$ by one. We must take the action $repeat$ if we have no remaining tries to pass the exam. In that case, we must repeat the whole course, which is quite expensive. Finally, the actions $exam_{ij}$ are the only probabilistic actions. They decrease the number of tries by one and set a grade depending on the knowledge level. If the grade exceeds $F$, it also sets $done=T$.

**Exercise 1:** Consider the pattern $P=\{level, done\}$ and draw the transition graph for the syntactic projection $\Pi_P$.

++++ Solution|
The states in the projection can be viewed as pairs $\{0F, 1F, 2F, 3F, 0T, 1T, 2T, 3T\}$ representing the values of $level$ and $done$, respectively. The action $repeat$ is irrelevant for the projection as it does not affect any variable in $P$. For the remaining actions, we have (recall that we have to sum up all probabilities of all effects that give the same outcome in the projection):

^ action ^ pre ^ eff ^ cost ^
| $study_i$ | $level=i$ | $level=i+1$ | $10$ |
| $exam_{0}$ | $level=0$ | $0.1:\ done=T\\ 0.9:\ done=F$ | $4$ |
| $exam_{1}$ | $level=1$ | $0.6:\ done=T\\ 0.4:\ done=F$ | $3$ |
| $exam_{2}$ | $level=2$ | $0.8:\ done=T\\ 0.2:\ done=F$ | $2$ |
| $exam_{3}$ | $level=3$ | $done=T$ | $1$ |

Note that the projection of $exam_{ij}$ is independent of the number of tries $j$. Thus, we have only actions $exam_i$.

{{ :courses:pui:tutorials:ppdb1.png?600 |}}
++++ 

**Exercise 2:** Compute the pattern database for the pattern $P=\{level,done\}$ from the previous exercise. Evaluate the heuristic $h_P(s_0)$ for
the initial state.

++++ Solution|
To compute the pattern database, we must compute the optimal value function $V^*_P$ on all non-goal states. The optimal value for $V_P^*(3F)=1$ as there is
only a single path going to a goal state of cost $1$. For the remaining, we perform the VI initialized with the heuristic given by the shortest deterministic path.

^ iteration ^ $V_P^*(2F)$ ^ $V_P^*(1F)$ ^ $V_P^*(0F)$ ^
| $0$ | $2$ | $3$ | $4$ |
| $1$ | $2+0.2\cdot 2=2.4$ | $3+0.4\cdot 3=4.2$ | $4+0.9\cdot 4=7.6$ |
| $2$ | $2+0.2\cdot 2.4=2.48$ | $3+0.4\cdot 4.2=4.68$ | $4+0.9\cdot 7.6=10.84$ |
| $3$ | $2+0.2\cdot 2.48=2.496$ | $3+0.4\cdot 4.68=4.872$ | $\min\{4+0.9\cdot 10.84,10+4.68\}=13.756$ |
| $4$ | $2+0.2\cdot 2.496=2.4992$ | $3+0.4\cdot 4.872=4.9488$ | $\min\{4+0.9\cdot 13.756,10+4.872\}=14.872$ |
| $\infty$ | $2.5$ | $5$ | $15$ |

The values are computed by the Bellman update operator. As the cost of $study_i$ is quite high (namely $10$), it is better to take $exam_i$ immediately during the initial iterations of the VI. However, this changes for $0F$ in the $4$-th iteration, namely
\[V^*_P(0F)=\min\{4+0.9\cdot 13.756,10+4.872\}=14.872\]
so the action $study_0$ provides a better value.

The limiting values can be computed from the optimality equation for $V^*_P$. For instance,
$V_P^*(2F)=2+0.2\cdot V^*_P(2F)$ implies $V^*_P(2F)=2.5$. Similarly, $V^*_P(1F)=5$. Thus $V^*_P(0F)=10+5=15$.

Thus the optimal policy for $\Pi_P$ is to study once $\pi^*(0F)=study_0$ followed by taking exam $\pi^*(1F)=exam_1$.
We have $h_P(s_0)=15$.
++++

**Exercise 3:** Consider the pattern $P=\{tries,level,done\}$. Is it true that the optimal policy in $\Pi_P$ also recommends studying once and then taking the exam as in the previous exercise?  

++++ Solution|
We first write down the actions in the projection and draw its transition graph for $\Pi_P$.

^ action ^ pre ^ eff ^ cost ^
| $study_i$ | $level=i$ | $level=i+1$ | $10$ |
| $repeat$ | $tries=0$ | $tries=2$ | $100$ |
| $exam_{0j}$ | $level=0, tries=j$ | $0.1:\ tries=j-1, done=T\\ 0.9:\ tries=j-1, done=F$ | $4$ |
| $exam_{1j}$ | $level=1, tries=j$ | $0.6:\ tries=j-1, done=T\\ 0.4:\ tries=j-1 done=F$ | $3$ |
| $exam_{2j}$ | $level=2, tries=j$ | $0.8:\ tries=j-1, done=T\\ 0.2:\ tries=j-1, done=F$ | $2$ |
| $exam_{3j}$ | $level=3, tries=j$ | $done=T$ | $1$ |

{{ :courses:pui:tutorials:ppdb2.png?800 |}}

The states are denoted as triples of respective values of $tries$, $level$, and $done$. For instance, $30F$ is the state $tries=3,level=0,done=F$. The $study_i$ actions are denoted by black arrows. The blue arrows are the actions $exam_{ij}$. The action $repeat$ is depicted in green.

To answer the question, we must evaluate at least approximately the optimal value function $V^*_P(30F)$. Note that the initial approximation 
by the shortest deterministic path is $V^*_P(30F)=4$, $V^*_P(31F)=3$, $V^*_P(32F)=2$, and $V^*_P(i3F)=1$ for all $i>0$. Next, we compute the Bellman update for the states where we must take the action $repeat$.

\begin{align*}
V^*_P(00F) &= 100 + 4 = 104\\
V^*_P(01F) &= 100 + 3 = 103\\
V^*_P(02F) &= 100 + 2 = 102\\
\end{align*}

Next, we deal with the states having one try for the exam:
\begin{align*}
Q(12F,exam_{21}) &= 2 + 0.2\cdot 102 = 22.4 & Q(12F,study_2) &= 10+1 = 11 & V^*_P(12F)=11\\
Q(11F,exam_{11}) &= 3 + 0.4\cdot 103 = 44.2 & Q(11F,study_1) &= 10+11 = 21 & V^*_P(11F) = 21\\
Q(10F,exam_{01}) &= 4 + 0.9\cdot 104 = 97.6 & Q(10F,study_0) &= 10+21 = 31 & V^*_P(10F)= 31\\
\end{align*}

In other words, if we have only a single try, we would rather study 3 times to be safe. Next, we continue with the states having two tries.

\begin{align*}
Q(22F,exam_{22}) &= 2 + 0.2\cdot 11 = 4.2 & Q(22F,study_2) &= 10+1 = 11 & V^*_P(22F)=4.2\\
Q(21F,exam_{12}) &= 3 + 0.4\cdot 21 = 11.4 & Q(21F,study_1) &= 10+4.2 = 14.2 & V^*_P(21F) = 11.4\\
Q(20F,exam_{02}) &= 4 + 0.9\cdot 31 = 31.9 & Q(20F,study_0) &= 10+11.4 = 21.4 & V^*_P(20F)= 21.4\\
\end{align*}

In other words, it pays off to study once if we have two tries. Finally, if we have 3 tries, we get:

\begin{align*}
Q(32F,exam_{23}) &= 2 + 0.2\cdot 4.2 = 2.84 & Q(32F,study_2) &= 10+1 = 11 & V^*_P(32F)=2.84\\
Q(31F,exam_{13}) &= 3 + 0.4\cdot 11.4 = 7.56 & Q(31F,study_1) &= 10+2.84 = 12.84 & V^*_P(31F) = 7.56\\
Q(30F,exam_{03}) &= 4 + 0.9\cdot 21.4 = 23.26 & Q(30F,study_0) &= 10+7.56 = 17.56 & V^*_P(30F)= 17.56\\
\end{align*}

Note that the values $V^*_P(32F)=2.84$, $ V^*_P(31F) = 7.56$, and $V^*_P(30F)= 17.56$ are the final ones. Even if we do another VI iteration and
increase the values for $V^*_P(00F)$, $V^*_P(01F)$, and $V^*_P(02F)$, the above computation will not change.

Thus, the optimal policy recommends for the state $30F$ to study, getting us to $31F$. Next, it recommends to take the exam.
Either we make it or end up in $21F$. For $21F$, the policy suggests to take the exam again. Either we succeed or get to $11F$. In $11F$,
we must study twice to get to $13F$ where we can safely take the exam. So, the answer to the question is "YES".
++++

**Exercise 4:** Consider the pattern $P=\{tries,done\}$ and compute the heuristic value for the initial state $s_0$.

++++ Solution|
The actions in the projection:

^ action ^ pre ^ eff ^ cost ^
| $repeat$ | $tries=0$ | $tries=2$ | $100$ |
| $exam_{0j}$ | $tries=j$ | $0.1:\ tries=j-1, done=T\\ 0.9:\ tries=j-1, done=F$ | $4$ |
| $exam_{1j}$ | $tries=j$ | $0.6:\ tries=j-1, done=T\\ 0.4:\ tries=j-1, done=F$ | $3$ |
| $exam_{2j}$ | $tries=j$ | $0.8:\ tries=j-1, done=T\\ 0.2:\ tries=j-1, done=F$ | $2$ |
| $exam_{3j}$ | $tries=j$ | $tries=j-1, done=T$ | $1$ |

The initial state in $\Pi_P$ is $3F$, i.e., three tries and not done. We can apply the deterministic action $exam_{33}$ of cost $1$ to get us 
to the goal state $2T$. We cannot make it cheaper than that. Thus $h_P(s_0)=1$.
++++