====== Lab 3 - MLE, Computational graph and Backpropagation ======

In this lab, we are going to test your knowledge about the math behind neural networks. These simple exercises are application of theory in lecture {{ :courses:b3b33vir:lectures:neural_nets.pdf | Lecture}} (slides 10-62). Tutorial solution to the first two exercises is {{ :courses:b3b33vir:tutorials:3.lab.pdf |here}}. We recommend to consult corresponding lecture parts for better overview and intuition behind these exercises. Another (graphically rich if preferred) source about learning mechanisms is here. It provides step-by-step description of feed-forward and backward pass [[https://hmkcode.com/ai/backpropagation-step-by-step/]].

<note important>
You are asked to write python code to validate your results. It takes only few lines and tests your understanding. You can also switch initial values and dimension of parameters to verify your ability to solve different problems. This part is important because you verify your ability to apply and also make you conscious what you are doing in the following parts of the subject. If your are not sure about syntax in Pytorch, you can look at the documentation [[https://pytorch.org/docs/stable/torch.html?highlight=torch] | Pytorch]] and search for modules.</note> \\


==== Simple Neural Network ====

You are given the following neural network model parametrized by weight vector **w**. Model takes as a input vector **x** and outputs y:
$$
y = \sin(\textbf{w}^T~\textbf{x}) - b
$$

Where:
$$\textbf{x} = [2, 1] , ~\textbf{w} = [\pi/2, \pi] ,~ b = 0 ,~ \tilde{y} = 2$$
\\
1) Draw a computational graph of forward pass of this small neural network \\
2) Compute feedforward pass with initial weights **w** and input data feature **x** \\
3) Calculate gradients of output y with respect to **w**, i. e $\frac{\partial y}{\partial \textbf{w}}$ \\
4) Use $L_2$ loss (Mean square error) to compute loss value between forward prediction y and label $\tilde{y}$. Add loss into computational graph. \\
5) Use chain rule to compute the gradient $\frac{\partial L}{\partial \textbf{w}}$ and update weights with learning rate parameter $\alpha$ = 0.5 \\ 
\\

<code python>
import torch
import numpy as np

### Define initial parameters
# w =
# x =
# b =
# y_label =

""" Note: Think about dimensions of initial parameters and order of operations """

# model forward pass: use torch.sin() and w @ x        ---> dot product @
# y =

# calculate loss and make backward pass
# L2 =

""" Note: Beware of backward passes when calculating it for both y and L. You need to do it separately """

# Visualize gradient of L2 wrt w: use L2.backward(), then you can see gradient in w.grad

</code>
==== Maximum Likelihood Estimate ====
{{ :courses:b3b33vir:tutorials:screenshot_2021-10-01_at_13.49.12.png?600 |}}

You are given Gaussian probability distribution model 
$$p(y|\mathbf{x},\mathbf{w}) = K\cdot\exp(-(y-f(\mathbf{x},\mathbf{w}))^2),$$ 
which models the probability of observing variable $y\in\mathbb{R}$, given measurement $\mathbf{x}\in\mathbb{R}$. The shape of probability distribution is determined by (unknown) parameters $\mathbf{w}_0, \mathbf{w}_1\in\mathbb{R}$ of non-linear function 
$$f(\mathbf{x},\mathbf{w}) = \frac{1}{1+e^{-(\mathbf{w}_0*\mathbf{x}_i + \mathbf{w}_1)}},$$
You are given a training set $\mathcal{D} = \{(\mathbf{x}_1, y_1)\dots (\mathbf{x}_N, y_N)\}$. 

  - Write down the optimization problem, which corresponds to the maximum likelihood estimate of unknown parameters $\mathbf{w}$ and simplify the resulting loss if possible.
  - Download {{ :courses:b3b33vir:tutorials:mle_regression.zip |template}} and fill in the learning loop (loss function, gradient ''loss.backward()'' and weight update rule).
  - Find reasonable learning rate. What happens, when learning rate is too big / too small?
  - Is the least square formulation (LSQ) equivalent to the maximum likelihood formulation(MLE)? What is not equivalent?
  - What are the necessary assumption, which allows for MLE and LSQ formulation?

<code python>
import torch
import matplotlib.pyplot as plt
import numpy as np


# load points
N = 5
pts = np.load('pts.npy')
pts= torch.tensor(pts)

# define optimization variables
w = torch.tensor([-2, 2], requires_grad=True, dtype=torch.double)

for i in range(30):
    # OPTIMIZE WEIGHTS ...
    # (1) define loss
    # (2) compute gradient loss.backward()
    # (3) update weights

    loss = ...
    loss.backward()
    with torch.no_grad():
        w -= learning_rate * w.grad
    w.grad.zero_()


    # visualize result
    PTS = pts.detach().numpy() # convert to numpy
    W = w.detach().numpy()
    T = torch.linspace(-1, 1, 50).numpy()
    plt.figure(1), plt.clf()
    plt.plot(PTS[:, 0], PTS[:, 1], markersize=10, marker='x', color='r', linestyle='None')
    plt.plot(T, 1/(1-np.exp(W[0] * T - W[1])), color='green')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.pause(0.01)
    plt.draw()
</code>