BE4M36NNO – Non-smooth non-convex optimization for training deep neural networks

BE4M36NNO – Non-smooth non-convex optimization for training deep neural networks

Schedule and location

The course will be held on Mondays at 11:00am-12:30pm (lecture) and 12:45pm-14:15pm (tutorial) from February 16 to May 18 (with no class on April 6). The course will be located in room KN:A-427.

Instructors

Lecturers: Allen Gehret (gehreall@fel.cvut.cz) and Jakub Marecek (jakub.marecek@fel.cvut.cz)

Tutors: Adam Bosak (adam.bosak@fel.cvut.cz) and Andrii Kliachkin (kliacand@fel.cvut.cz)

Grading policy

Overall, you can collect up to 100 points with the usual conversion to A-F grades (<50 = F, 50-59 = E, …, 90-100 = A).

You can gain points for working on homework (up to 70 pts) and doing well in the final exam (up to 30 pts).

Homework

There will be four homeworks, assigned at regular intervals (for instance, Weeks 3, 6, 9, and 11), with at least 2 weeks to hand in the worked assignment. In order to obtain the “zapocet”, you need to hand in at least two assignments.

Final exam

The final exam will count for 30% of the grade and will take place after the final lecture (specific details to be announced later). It will be in-person and closed-notes (no notes or computer).

LLM policy

We do not discourage LLM use as a tool to aid curiosity and deepen understanding. We discourage LLM use as a tool for abdicating critical thinking and accountability. Please note:

the final exam will be in-person and closed-notes; in particular no computer will be available
for the homework we are well aware that most, if not all, of these problems can be immediately solved and explained by LLMs; we assign the homework anyway because we think it is interesting

If you use an LLM, treat it like a calculator; disclose any LLM use and always verify and cite sources.

Prerequisites

We will assume a familiarity with linear algebra (e.g., matrices, vector spaces, linear transformations, dimension, etc.) as well as with (multivariable) calculus (e.g., limits, continuity, derivatives, gradients, Jacobians, etc.), although we will review any of these concepts as needed.

We also assume a general familiarity with “discrete mathematics”, e.g., basic operations with functions and sets (composition, union, intersection, etc.), propositional logic (and, or, not, implies, if and only if), the standard number systems (natural numbers, integers, rational numbers, real numbers), proofs by induction, definitions by recursion, and understanding and using quantifiers (“for all x” and “there exists x”).

We do not assume prior knowledge of optimization.

Course contents

We will post regularly-updated lecture notes as the course progresses. The course will be an expanded version of the expository survey:

Deep learning as the disciplined construction of tame objects (Bareilles, Gehret, Aspman, Lepsova, Marecek; arXiv 2025)

The overarching goal of the course will be to understand a (non-stochastic, deterministic) version of the convergence of the Stochastic Subgradient Method (SSM) in the nonsmooth nonconvex o-minimal setting, as established in:

Stochastic subgradient method converges on tame functions (Davis, Drusvyatskiy, Kakade, Lee; Foundations of Computational Mathematics, 2020)

In order to do this, we need understand the above adjectives “o-minimal” and “tame”. For this our main reference is:

Tame topology and o-minimal structures (van den Dries; London Mathematical Society Lecture Note Series, 1998)

The following is an approximate schedule of topics for the course:

Introduction
Semialgebraic geometry
o-minimal structures on the real numbers; consequences of definability
Examples of o-minimal structures, definability of activation functions, loss functions, and many NNs
Single-variable theory: local o-minimality, existence of 1-sided limits, the monotonicity theorem, definable choice, curve selection
Application: convergence of central path in optimization
Multi-variable theory: the dimension theorem, cell decomposition, manifolds stratifications (Whitney-(a))
The Clarke subgradient, the projection formula in the o-minimal setting, the chain rule, and subgradient descent
Convergence of SSM in the o-minimal setting
Autodiff theory, PyTorch and Tensorflow, backpropagation, conservative set-valued fields
Bonus: Generalizations, VC-dimension and density

Table of Contents