Online lectures are held on Mondays 16:15 in this MS Teams Channel.

In the lecture descriptions below, we refer to this **supplementary course material**:

**Relevant**

- UAI:
*M. Hutter: Universal Artificial Intelligence, Springer 2005* - AIMA:
*S. Russel, P. Norvig: Artificial Intelligence: A Modern Approach - $3^{rd}$ edition, Prentice Hall 2010* - LRL:
*L. de Raedt: Logical and Relational Learning, Springer 2008* - ILP:
*S.-H. Nienhuys-Cheng and R. de Wolf: Foundations of Inductive Logic Programming, Springer 1997* - COLT:
*M. J. Kearns, U. Vazirani: An Introduction to Computational Learning Theory, MIT Press 1994*

**Marginally Relevant**

- ESL:
*T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning, Springer 2009* - KC:
*M.Li, P. Vitányi: An Introduction to Kolmogorov Complexity and Its Applications, Springer 2019*

Except for AIMA and COLT, the books above are available on SpringerLink for CVUT students. Click on the link, log in through “institutional access” or access the portal through the CVUT library or from the CVUT IP domain with no authorization needed.

Regarding AIMA, unless stated otherwise, chapter references below w.r.t. the 3rd edition chapter numbering which is different from the 4th edition under the link above.

You are strongly discouraged from using this course's materials from previous years as you would run into confusions.

Complete set of lecture slides up to the last lecture so far.

Notes

- The lecture slides contain links to relevant exercise problems. For reasons I cannot influence, the links take to the very bottom of the appropriate page giving the impression that the
*next*problem in the problem set is the one linked. So please scroll up, not down, after you have followed an exercise link. - In the slide sets for individual lectures below, hyperlinks to out-of-lecture places are obviously broken. These sets are meant only for orientation, please use the above full set for study.

Here we introduce the basic concepts regarding a computational agent trying to operate intelligently in an unknown environment. We formalize the notions of actions, rewards, observations, utility, sequential vs. non-sequential decision making, decision policy, and classification. The learning scenarios later in this course will all use these concepts, and will all be special cases of the framework introduced in this lecture.

The framework we use is as in UAI. Chapter 1.4 of the book gives a brief account, Chapter 4 is more elaborate. The book uses the letters $\mu, o, x, V$ respectively for probability, observation, percept, and utility (=value) function; we use $P, x, xr$ and $U$.

The infinite version of the utility is the same in spirit to the utility introduced in AIMA in (21.1) (page 833) although we do not (yet) relate utility to a *state* (you may view this as our utility referring to the unique initial state). The letter $\pi$ for policy in the book corresponds to our function symbol $y$ of the function $y(x)$.

We define the task of classification (finite decision set & instant rewards; not necessarily i.i.d. observations) and then focus on its simplest interesting case: *concept classification*, which is essentially binary classification without noise. We will define the notions of a *concept* and *hypothesis* and also the *mistake-bound model* of concept learning requiring that the learner makes only a polynomial (in the size of observations) number of classification mistakes. We will introduce the Winnow concept learning algorithm which uses a hyperplane-separation strategy. Then we will focus on an alternative strategy consisting in logical generalization of examples towards a hypothesis.

The theoretical concepts of logical generalization are treated in somewhat greater breadth in Chapter 5 of LRL and in much greater depth in Chapter 14 of ILP. However these sources focus on *clauses (=disjunctions)* whereas we start with more emphasis on *conjunctions* which are easier from the cognitive viewpoint. Also, most of the focus in the latter sources is on *first-order logic* clauses, which we are yet to visit. The instant rewards we define for classification including concept classification correspond to special cases of (negative) *loss functions* which are important in statistical learning (ESL); they are also studied in AIMA.

We will study in depth the two approaches to concept learning we introduced last week: Winnow and the generalization algorithm. We will prove their mistake bounds indicating that Winnow learns monotone disjunctions online from truth-value assignments and the generalization algorithm learns conjunctions online from contingent conjunctions (=truth-value assignments which may be *incomplete*). We will define when a concept class is *learnable* online. We will show two reduction techniques (attribute expansion, concept inversion) enabling to learn additional concept classes beyond those already proven learnable. This includes DNF's and CNF's where the size of the included terms (clauses, respectively) is bounded by a constant.

The original proof for the Winnow mistake bound (somewhat more complex and broader than our demonstration) is in the original paper by Littlestone (pg. 300). General survey papers on computational learning theory are linked from Wikipedia; they mostly focus on the PAC-learning model which we are yet to visit. The attribute expansion technique we use is roughly analogical to the *basis expansion* method used in statistical learning and studied e.g. in ESL.

We will prove the online learnability of DNF's and CNF's where the number of the included terms (clauses, respectively) is bounded by a constant. Then we consider learning a clausal hypothesis from clausal observations and show that this can be accomplished with the generalization algorithm just as defined in Lecture 2. We will then consider a language for observations and hypotheses that is stronger than propositional logic, in particular the first-order predicate logic (FOL). We extend the definitions of subsumption and least general generalization to FOL conjunctions and FOL clauses, and present an algorithm to compute a least general generalization in the FOL case. We will see that using the latter algorithm, rather expressive knowledge can be learned through the generalization strategy. Unfortunately, this increased expressiveness makes it impossible to prove a mistake bound similar to that we demonstrated in the propositional case.

To understand this lecture, knowledge of FOL is required at least at the level of the undergraduate course Logic and Graphs. If you lack that knowledge, please study the first two chapters of ILP.

Computation of a least general generalization of clauses was proposed in the seminal paper by G. Plotkin. (The proof presented therein is not part of material tested in the final exam.) The computation involves the anti-unification algorithm. The theoretical concepts of logical generalization are presented in somewhat greater breadth in Chapter 5 of LRL and in much greater depth in Chapter 14 of ILP.

This week we will finish our brief excursion into learning using FOL as the representation language for observations and hypotheses. We will define when two FOL clauses are equivalent and when one is reduced. We will review examples of FOL least general generalization we introduced last week. Then we will explore the salient learning feature enabled by the FOL framework, in particular learning in the presence of **background knowledge** $B$, i.e., FOL knowledge the agent has before the concept-learning interaction starts. We will see that for some observations, the learned generalization does not make good sense but when generalized with respect to $B$, a reasonable hypothesis is learned. To this end, we will introduce the notions of relative (to $B$) consequence, relative subsumption, and relative reduction. Finally, we look at learning a FOL size-bounded CNF from Herbrand *intepretations*, which are FOL analogies to propositional truth-value assignments and can be interpreted as *full observations* (unlike clausal or conjunctive examples). In this setting, we will prove online learnability which we were not able to prove in the setting of learning from arbitrary FOL clauses or conjunctions.

This lecture is based mainly on Chapter 14 and Section 16.2 of ILP. The positive result on size bounded CNF was published in a paper by De Raedt and Džeroski.

In this lecture we turn away our attention from the structure of specific hypothesis classes and investigate the properties of learning agents working with arbitrary classes. Using the version-space algorithm (also known as halving algorithm), we show that any finite hypothesis class is learnable online if it is a subset of the learner's hypothesis class and its size is at most exponential in the observation complexity. Then we introduce a property of a concept class called the VC-dimension and show that a polynomial VC-dimension is necessary for the concept class to be learnable online. We then adopt the assumption that observations are i.i.d, which allows us to define an alternative learnability model called PAC-learnability. The model requires the agent to find a low-error hypothesis with high probability. Using the notion of a `standard agent', we show that mistake-bound learnability implies PAC-learnability. Then we prove that any PAC learner must necessarily be able to find a hypothesis consistent with all observations seen so far. Finally, we show that any finite hypothesis class is PAC-learnable if it is a subset of the learner's hypothesis class and either its VC dimension is polynomial or (for finite classes) its size is exponential in the observation complexity.

General survey papers on computational learning theory are linked from Wikipedia. The book COLT provides a more extensive coverage of PAC-learning.

This lecture will conclude the concept-learning part of the course and also our excursion into computational learning theory. We will define *proper* PAC-learning which requires that the hypotheses an agent learns when PAC-learning a hypothesis class are themselves in that class. For example, using the large but 'easy' class k-CNF for learning the smaller but 'difficult' class k-term DNF is not allowed under proper PAC-learning. We will see that some classes including depth-bounded *decision trees*, are PAC-learnable either efficiently or properly but not efficiently properly. On the other hand, we will introduce the *decision lists* class, which formalizes the notion of a *rule set* and which is efficiently properly PAC-learnable. Finally, we will consider the case where consistent learning is not possible (e.g. due to noise in data) and thus we cannot learn in the PAC sense. We will show that the assumption of i.i.d. observations allows us to upper-bound the difference between the error of the learned hypothesis and its training error.

We first abandon the assumption of a deterministic *target concept* and assume that the target class depends on the observation probabilistically. We consider an agent learning such a probabilistic dependence. We immediately generalize this setting to one with no fixed class variable. Here, the agent is requested to predict the most probable values of arbitrary missing components of observations given the values of the observed components (this task is more general than the former because any missing observation component can “play the role” of the class variable). To accomplish this, the agent needs to learn a probability distribution from samples of it with missing values. This task is not tractable in general but we will consider the assumption of *conditional independence* between random variables (corresponding to the observation components) to lower the task complexity. We will introduce the framework of *Bayes Networks* which can model arbitrary probability distributions of discrete random variables, leveraging conditional independencies among variables. We will discuss the *d-separation* concept by which such independencies are inferred from a Bayes network.

Chapter 14 of AIMA (or Chapter 13 in the 4th edition under the link) is a good supplementary material although it does not cover d-separation. Bayes Networks are one example of a wider class of graphical probabilistic models which–unlike more conventional statistical models–are notable for their interpretable (“symbolic”) structure.

We will first observe that naive computation of probabilities from a Bayes Net involves redundant computation and we will present a method based on *factors* that remove these redundancies making inference faster (although the worst case complexity remains exponential). Then we will study a method for *MAP inference*, i.e., for determining the most probable joint state of unobserved variables given the observed variables without evaluating the probabilities of all possible joint states. The method is also based on factors. Afterwards we will see how the Bayes network parameters (i.e., the conditional probability tables) can be learned from observations when the Bayes graph is given. Finally, we will briefly discuss some extensions of Bayes networks and the field of statistical relational learning which combines the expressiveness of FOL with the probabilistic reasoning capabilities of graphical probabilistic models.

Chapter 14 of AIMA (or Chapter 13 in the 4th edition under the link) is relevant to this lecture although it does not cover the MAP inference method explained in the lecture. FOL extensions of graphical probabilistic models are addressed in Chapter 8 of LRL.

We will start our investigation of *reinforcement learning* in which an agent has to learn to maximize rewards in an environment which is *sequential* in the sense that observations as well as rewards depend on the previous history of agent-environment interaction. Observations will capture the environment *states*, and we will make the *Markovian* assumption that the current state depends only on the previous one and the action taken in it (with a state *transition* probability), and that the set of states is finite. In this setting, it will be also natural to assume that rewards are a function of states. We will first look into how to compute the optimal policy if the transition probabilities and the reward function are known. To this end, we will introduce the notion of *state utility*. Afterwards we will discuss how to achieve a good policy if these two elements are unknown. We will face the exploration-exploitation dilemma we already considered in Lecture 1 and adapt from it a strategy making random explorative actions with decaying probability; this is called a *GLIE* strategy.

Chapters 17 and 21 of AIMA (Chapters 17 and 22 in the 4th edition under the link) covers the material presented. States and actions in AIMA are respectively denoted by $s$ and $a$; we use letters $x$ and $y$ in coherence with the previous lectures. RL is a more extensive introduction into reinforcement learning.

We will consider a heuristic alternative to GLIE where under-explored state-action pairs are made more attractive to the agent to force their exploration. At this point, we will have covered all components needed to implement the *adaptive dynamic programming* (ADP) agent for reinforcement learning. We will consider two adaptations of ADP making it faster responsive; this involves the respective techniques of *direct utility estimation* (DUE) and *temporal difference learning* (TD). In turn, we will explore an approach that does not involve state utilities but is based on utilities of *state-action pairs*. We will consider two variants of this approach called *Q-learning* and *SARSA*. Next, we attend the issue of how to represent, store and learn state or state-action utility estimates when there is a lot of states and/or actions.

Appropriate supplementary material is as in the previous lecture.

We first finish the reinforcement learning chapter by discussing briefly the *policy search* method where the policy, i.e. a $X \rightarrow Y$ mapping is optimized directly from observation data, and the *Bayesian* approach to learning an environment model. We then abandon the special assumptions on the probabilistic environment description we have adopted so far, and revisit the general case where percepts depend probabilistically on the entire history of agent-environment interaction. For a start, we concentrate on a simple scenario without actions and rewards, where the goal is simply to predict the next element of a given (binary) sequence of observations. We explore the hypothesis that good predictions are those which can be computed by simple programs for the universal Turing machine. We will face the obstacle that there is no algorithm to determine the length of the shortest program computing a given sequence, i.e. the Kolmogorov complexity of the sequence.

The book UAI deals with the topics of this lecture. KC is a very extensive treatment of Kolmogorov complexity and its applications.

We will show that Kolmogorov complexity, while not computable, is co-enumerable. We will define the Solomonoff universal prior $M$ which assigns high probability to sequences produced by short programs. We shall show that $M$ predicts the next element of any computable sequence given a prefix of it with accuracy approaching 1 as the prefix length grows. Interestingly, $M$ is equivalent (up to constant multiplier) to a Bayesian mixture involving all enumerable probability (semi-)measures with weights exponentially decreasing with their Kolmogorov complexity. This form of $M$ reveals that $M$ combines the principle of multiple explanations of observed phenomena proposed by the Greek philosopher Epicurus with the principle of the simplest explanation advocated by the English philosopher William of Ockham through the mathematical principle devised by Thomas Bayes.

Appropriate supplementary material is as in the previous lecture.

courses/smu/lectures.txt · Last modified: 2021/05/18 17:02 by zelezny