Christophe Cerisara - 2023-2024
MCQs in class (coef 1)
mini-project (coef 2)
send to cerisara@loria.fr
Questions?
Machine Learning
Deep Learning
“Deep” model
def f(x):
if x>C: x*=2
else: x*=3
return x
Training
data= (3,10)
err = (f(3)-10)**2
err.backward() # calc dérivée
C -= 0.1*C.grad # diminue erreur
To draw a sinusoid in JavaScript, you can use the Canvas API. Here's an example of how to do it:
// Get the canvas element
var canvas = document.getElementById("myCanvas");
var ctx = canvas.getContext("2d");
// Define the sinusoid parameters
var amplitude = 100;
var frequency = 1;
var phase = 0;
var x = 0;
// Draw the sinusoid
for (var i = 0; i < canvas.width; i++) {
var y = amplitude * Math.sin(frequency * i + phase);
ctx.beginPath();
ctx.moveTo(x, y);
ctx.lineTo(x + 1, y);
ctx.stroke();
x++;
}
This code will draw a sinusoid with the specified amplitude, frequency, and phase, and will update the canvas as the user scrolls.
“Train 1 departs at 6 AM from city A and arrive in city B in 11 AM. Train 2 departs at 7:30AM from city B and arrive in City A at 11 AM. When will both trains meet?”
train1_travel_time = 11 - 6
# 7:30 AM refers to 7.5
train2_travel_time = 11-7.5
dist_a_b = Symbol("D",positive=True)
train1_speed = dist_a_b / train1_travel_time
train2_speed = dist_a_b / train2_travel_time
ans = dist_a_b / (train1_speed + train2_speed)
Use numbers and basic arithmetic operations (+ - * /) to obtain 24. When considering the next steps, do not choose operations that will result in a negative or fractional number. In order to help with the calculations, the numbers in the parenthesis represent the numbers that are left after the operations and they are in descending order...
User:
11 7 4 1
Assistant:
Let’s consider the most promising first steps:
1. 7 * 4: (28, 11, 1)
2. 7 + 1: (11, 8, 4)
3. 4 + 1: (11, 7, 5)
4. 11 - 4: (7, 3, 1)
Let’s consider the most promising second steps:
1. (28, 11, 1)
- 28 - 11: (17, 1)
- 28 - 1: (27, 11)
- 11 + 1: (29, 28)
- 11 - 1: (28, 10)
2. (11, 8, 4)
- 11 + 8: (19, 4)
- 8 + 4: (12, 11)
- 11 - 8: (4, 3)
- 8 - 4: (7, 11)
3. (11, 7, 5)
- 11 - 5: (7, 6)
- 7 - 5: (11, 2)
- 7 * 5: (35, 11)
- 11 + 5: (16, 7)
4. (7, 3, 1)
- 7 - 3: (4, 1)
- 7 * 3: (21, 1)
- 3 + 1: (7, 4)
- 7 - 1: (6, 3)
Let’s consider the most promising third steps:
1. 7 * 4 = 28
- 28 - 11: (17, 1): 18, 16, 17, 17
- 28 - 1: (27, 11): 38, 16, 297, 2.45
- 11 + 1: (29, 28): 57, 1, 812, 1.03
- 11 - 1: (28, 10): 38, 18, 280, 2.8
2. 7 + 1 = 8
- 11 + 8: (19, 4): 23, 15, 76, 4.75
- 8 + 4: (12, 11): 23, 7, 132, 3
- 11 - 8: (4, 3): 7, 1, 12, 1.33
- 8 - 4: (7, 11): 18, 4, 77, 1.75
3. 4 + 1 = 5
- 11 - 5: (7, 6): 13, 1, 42, 1.17
- 7 - 5: (11, 2): 13, 9, 22, 5.5
- 7 * 5: (35, 11): 46, 24 = 35 - 11 -> found it!
Backtracking the solution:
Step 1:
4 + 1 = 5
Step 2:
7 * 5 = 35
Step 3:
35 - 11 = 24
Considering these steps: 24 = 35 - 11 = (7 * 5) - 11 = (7 * (4 + 1)) - 11 = 24.
answer: (7 * (4 + 1)) - 11 = 24.
Basic archi: for a classification task with \(N\) labels, you stack:
Training:
Special output cases:
Special input features:
Naive DNN overfits \(\rightarrow\) we want to reduce variance
This can be done by adding a term to the loss that limits the space where parameters can live:
L2 can be interpreted as adding a Gaussian prior:
see p. 324 of Larry Bayesian Inference book
Assume simple linear model \(y=\beta x + \epsilon\)
So cond. likelihood is \(p(y|x,\beta) = \prod_t N(y_t;\beta x_t,\sigma^2)\)
data augmentation: Replicate dataset by adding \(\lambda\)% noise
SGD: Tune mini-batch size
Why initialization is important ?
So we want the variance of the inputs and outputs to be similar !
For each neuron with \(n_{in}\) inputs:
\[Var(W)=\frac 1 {n_{in}}\]
Refs: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
For each neuron with \(n_{in}\) inputs, going into \(n_{out}\) neurons:
\[Var(W)=\frac 2 {n_{in} + n_{out}}\]
http://pytorch.org/docs/master/nn.html?highlight=torch nn init#torch.nn.init.xavier_normal
“simple” MLP?
Andrew Ng: “If you tune only one thing, tune the learning rate!”
Autodiff = set of techniques to numerically evaluate the derivative of a function specified by a program
Exemple of computation graph:
Every gate in the circuit computes its local gradient with respect to its output value.
Exemple 1: autograd in pytorch
import torch
import torch.nn as nn
x=torch.ones(2,2)
x.requires_grad = True
y=x+2
z=y*y*3
out = z.mean()
out.backward()
print(x.grad)
Exemple 2: implementing backprop in numpy:
https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#warm-up-numpy
import numpy as np
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self,nins,nout):
super(Net, self).__init__()
self.nins=nins
self.nout=nout
nhid = int((nins+nout)/2)
self.hidden = nn.Linear(nins, nhid)
self.out = nn.Linear(nhid, nout)
def forward(self, x):
x = torch.tanh(self.hidden(x))
x = self.out(x)
return x
def test(model, data, target):
x=torch.FloatTensor(data)
y=model(x).data.numpy()
haty = np.argmax(y,axis=1)
nok=sum([1 for i in range(len(target)) if target[i]==haty[i]])
acc=float(nok)/float(len(target))
return acc
def train(model, data, target):
optim = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
x=torch.FloatTensor(data)
y=torch.LongTensor(target)
for epoch in range(100):
optim.zero_grad()
haty = model(x)
loss = criterion(haty,y)
acc = test(model, data, target)
print(str(loss.item())+" "+str(acc))
loss.backward()
optim.step()
def genData(nins, nsamps):
prior0 = 0.7
mean0 = 0.3
var0 = 0.1
mean1 = 0.8
var1 = 0.01
n0 = int(nsamps*prior0)
x0=var0 * np.random.randn(n0,nins) + mean0
x1=var1 * np.random.randn(nsamps-n0,nins) + mean1
x = np.concatenate((x0,x1), axis=0)
y = np.ones((nsamps,),dtype='int64')
y[:n0] = 0
return x,y
def toytest():
model = Net(100,10)
x,y=genData(100, 1000)
idx = np.arange(len(x))
np.random.shuffle(idx)
train(model,x[idx],y[idx])
if __name__ == "__main__":
toytest()
What kind of model is this? Which non-linearity is used?
How is the data distributed?
Is a stochastic gradient descent used?
Do we need to shuffle the training data from one epoch to another?
Run it and plot the curves for the loss and for the accuracy
Adapt the code to create a test corpus with the same method as for the training corpus and evaluate your model
import numpy as np
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self,nins,nout):
super(Net, self).__init__()
self.nins=nins
self.nout=nout
nhid = int((nins+nout)/2)
self.hidden = nn.Linear(nins, nhid)
self.out = nn.Linear(nhid, nout)
def forward(self, x):
x = torch.tanh(self.hidden(x))
x = self.out(x)
return x
def test(model, data, target):
x=torch.FloatTensor(data)
y=model(x).data.numpy()
haty = np.argmax(y,axis=1)
nok=sum([1 for i in range(len(target)) if target[i]==haty[i]])
acc=float(nok)/float(len(target))
return acc
def train(model, data, target, devx, devy):
optim = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
x=torch.FloatTensor(data)
y=torch.LongTensor(target)
idx = np.arange(x.size(0))
for epoch in range(1000):
np.random.shuffle(idx)
optim.zero_grad()
haty = model(x[idx])
loss = criterion(haty,y[idx])
acc = test(model, devx, devy)
print(str(loss.item())+" "+str(acc))
loss.backward()
optim.step()
def genData(nins, nsamps, mean0, mean1):
prior0 = 0.7
var0 = 0.1
var1 = 0.01
n0 = int(nsamps*prior0)
x0=var0 * np.random.randn(n0,nins) + mean0
x1=var1 * np.random.randn(nsamps-n0,nins) + mean1
x = np.concatenate((x0,x1), axis=0)
y = np.ones((nsamps,),dtype='int64')
y[:n0] = 0
return x,y
def toytest():
model = Net(100,10)
x,y=genData(100, 10, 0.3, 0.8)
devx,devy=genData(100, 100, 0.32, 0.8)
tex,tey=genData(100, 1000, 0.34, 0.8)
train(model,x,y,devx,devy)
if __name__ == "__main__":
toytest()
"""
python td1.py > tt
set y2tics
plot "tt" u 1 w l axis x1y1 t "train loss", "tt" u 2 w l axis x1y2 t "dev acc"
overfitting: lr=0.01, 10 training samples, 1000 epochs
L2 regul: weight_decay=1e-3
"""
Let’s consider a DNN \(f(x_0)\), with a FF layer somewhere inside: \[y=relu(Wx)\]
MSE can be used also in classification:
with softmax and \(N\) classes: \[p(y_i|x) \simeq \frac {e^{\hat y_i}}{\sum_j e^{\hat y_j}}\]
assume 2 classes, 2 neurons; let \(i=\arg\max_{j\in \{0,1\}} y_j\) \[l(\hat y,y) = 2(1-y_i)^2\]
Loss: Number of errors: \(1- \delta(y,sgn(f(x)))\)
Finds a hyper plane that separates training data
Algorithm converges if training data separable (Novikoff,1962)
\(w\) is normal to the hyper-plane
distance \(r\) of any \(x\): \[w^T \left(x-yr\frac w {||w||} \right) + b=0\]
Solve for \(r\): \(r=\frac{y(w^Tx+b)}{||w||}\)
We choose a scaling of \(w\) and \(b\) s.t. \[y_i(w^Tx_i+b)\geq 1\]
with at least one equality, margin= \(\rho = \frac 2 {||w||}\)
Find \(\theta\) that minimizes \(\frac 1 2 || \theta ||\) s.t. \[y_i X_i\theta \geq 1 ~~~ \forall i\]
= Quadratic programming problem
Gives a unique hyperplane as far as possible from training examples
The solution only depends on a subset of training examples that appear on the margin = the support vectors
Limits:
if training data not separable, the hyperplane does not exist
even a single example can radically change the position of the hyperplane
Example:
Gaussian model for \(P(X|Y)\):
Assume we don’t know any label \(Y\).
All we can do is fit a model on \(X\):
\[\hat\theta = \arg\max_\theta \prod_t P(X_t|\theta)\]
\[P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(X|Y=y,\theta)P(Y=y)\]
Assume \(P(Y)\) uniform: we only need to maximize \[\prod_t \sum_i \frac 1 {\sqrt{2\pi}\sigma_i}e^{-\frac 1 2 \left( \frac{X_t-\mu_i} 2 \right)^2}\]
Linear model: \[\hat y = w x + b\]
We want a probability: logistic regression: \[P(Y=1|X) = \frac 1 {1+e^{-(wx+b)}}\]
\[P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(Y=y|X,\theta)P(X)\]
Getting \(P(X)\) out of the sum: \[P(X|\theta)=P(X)\]
The observation likelihood does not depend on the parameters !
So it is impossible to maximize it…
How many folds ?
Best option: further split the training set on 2 subsets: training + development
Alternative: nested cross-validation
For each test fold i:
- For each possible values of hyper-parms:
For each dev fold j != i:
- train on remaining folds
- evaluate on dev fold j
Average results on all dev folds
- Pick the best hyper-parms
- Retrain model on all folds but i
- Evaluate on test fold
Average results over test folds
Standard classification metric: accuracy \[ acc = \frac{N_{OK}} {N_{tot}}\]
Question: on the same test set,
with \(n=\) nb of samples in the test corpus.
Example with 1000 examples in test:
So we can not conclude that C2 is better than C1
Precision: \(p = \frac {TP}{TP+FP}\)
Recall: \(r = \frac {TP}{TP+FN}\)
Accuracy: \(a = \frac {TP+TN}{TP+TN+FP+TN}\)
We may want to be sure that every positive detected is really positive
Adjust decision threshold:
Precision-recall curve:
Receiver Operating Characteristic:
https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
\(\rightarrow\) Descent methods
Any \(\Delta x\) that verifies:
\[\Delta x^T \nabla f(x) < 0\]
is a descent direction.
How to find the descent direction ?
Simplest approach: Gradient descent: \[\Delta x = -\nabla f(x)\]
For strictly convex functions, we have linear convergence: \[f(x^k)-p \leq c\left( f(x^{k-1})-p \right)\]
with \(0 < c < 1\)
\[f(x)=\frac 1 2 (x_1^2 + ax_2^2)\]
\[\nabla f(x) = [ x_1 , ax_2 ]\]
From the point (1,1), the steepest descent is (-1,-a)
But we know \(\min f\) is in (0,0) \(\rightarrow\) optimal direction (-1,-1)
So scaling of inputs is very important !
Multiply the gradient with the inverse Hessian:
\[\Delta x = -H(x)^{-1} \nabla f(x)\]
\[H(x)=\nabla^2 f(x) = \frac {\partial^2 f(x)}{\partial x \partial x^T}\]
Handles relative scaling between dimensions, and so is invariant to linear changes of coordinates
If \(f(x)\) quadratic, gives the global optimum direction
But it is not invariant to higher order (e.g. quadratic) changes of coordinates
It may be very costly (\(O(N^3)\) with \(N\)=nb of dimensions) to solve the linear system \(H(x)^{-1}\nabla f(x)\)
Approximate \(f\):
\[f^k(x+v)=f(x)+\nabla f(x)^T v + \frac 1 2 v^T B_k v\]
The direction of the descent is:
\[\Delta x = -B_k^{-1} \nabla f(x)\]
\(B_k\) is an approximation of the Hessian that is continuously updated
See also Kylov methods to approximate Hessian
Special case of Quasi-Newton method: Broyden, Fletcher, Goldfarb and Shanno.
–
\[s_k = x_{k+1} - x_k\]
\[y_k = \nabla f(x^{k+1})-\nabla f(x^k)\]
\[\rho_k = \frac 1 {y_k^T s_k}\]
\[F_{k+1} = (I-\rho_k s_k y_k^T) F_k (I-\rho_k y_k s_k^T) + \rho_k s_k s_k^T\]
Descent direction \(F_k \nabla f(x)\) can be computed in \(O(N^2)\)
As well as conjugate gradient:
Assume the function to minize has the form: \[f(x)=\frac 1 T \sum_t^T f_t(x)\]
This is the case for
Standard Gradient Descent:
\[x \leftarrow x - t\nabla f(x)\]
SGD approximates the gradient with the gradient at a single sample:
\[x \leftarrow x - t\nabla f_i(x)\]
In practice:
There are proofs that, under relatively mild assumptions, and when the learning rates \(t\) decrease with an appropriate rate, then SGD converges almost surely to a global minimum when the function is convex, and otherwise to a local minimum.
SGD + back-propagation is used to train all deep neural nets
Recent research works suggest that SGD is so successful because such an approximation of the gradient strongly regularizes the learning process.
Momentum:
\[\Delta x \leftarrow \alpha \Delta x - t \nabla f_i(x)\]
decay: the learning rate decreases over time
Adagrad: different learning rate per dimension
RMSProp: uses a running average of recent gradients per dimension
Adam: enhancements of RMSProp
mini-batch SGD: rule of thumb: use 32, 64, 128 or 256
Example: Check whether a person appears in a photo
Reduce number of parameters by sharing connection weights:
Kernel size = 3x3
\[(I * K)_{x,y} = \sum_{i=1}^h \sum_{j=1}^w K_{ij} \cdot I_{x+i-1,y+j-1}\]
\[conv(I,K)_{x,y} = \sigma\left(b+\sum_{i=1}^h \sum_{j=1}^w \sum_{k=1}^d K_{ijk} \cdot I_{x+i-1,y+j-1,k}\right)\]
Stride = 1 …………………. Stride = 2
Receptive field = 3
Kernel size 3x3, stride 2, padding 1
With appropriate kernel size, padding and stride1, the resulting “blurred image” has the same size as the input image
We want to compress the information and reduce the layers size progressively down to the final layer:
Max pooling = pick the max value within a filter size
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
input layer should be divisible by 2: 32, 64, 96, 224, 384, 512
conv layer use small filters (3x3, at most 5x5)
pool layer: usually 2x2 with stride 2
Inception module:
Karen Simonyan, Andrew Zisserman, 2014
Classical architecture but much deeper
G. Huang et al., 2016
each layer directly connected to all following layers
Yoon Kim, 2014
Not very deep, but extremely efficient !
# trainx = 1, seqlen, 1
# trainy = 1, seqlen, 1
trainds = torch.utils.data.TensorDataset(trainx, trainy)
trainloader = torch.utils.data.DataLoader(trainds, batch_size=1, shuffle=False)
testds = torch.utils.data.TensorDataset(testx, testy)
testloader = torch.utils.data.DataLoader(testds, batch_size=1, shuffle=False)
crit = nn.MSELoss()
class Mod(nn.Module):
def __init__(self, nhid):
super().__init__()
self.cnn = nn.Conv1d(in_channels=1, out_channels=nhid, kernel_size=3)
self.mlp = nn.Linear(nhid,1)
def forward(self,x):
# x = B, T, d
...
mod=Mod(h)
x=torch.rand(1,7,1)
y=mod(x)
print(y.shape)
def test(mod):
mod.train(False)
totloss, nbatch = 0., 0
for data in testloader:
inputs, goldy = data
haty = mod(inputs)
loss = crit(haty,goldy)
totloss += loss.item()
nbatch += 1
totloss /= float(nbatch)
mod.train(True)
return totloss
def train(mod):
optim = torch.optim.Adam(mod.parameters(), lr=0.001)
for epoch in range(nepochs):
testloss = test(mod)
totloss, nbatch = 0., 0
for data in trainloader:
inputs, goldy = data
optim.zero_grad()
haty = mod(inputs)
loss = crit(haty,goldy)
totloss += loss.item()
nbatch += 1
loss.backward()
optim.step()
totloss /= float(nbatch)
print("err",totloss,testloss)
print("fin",totloss,testloss,file=sys.stderr)
mod=Mod(h)
# visualize the model
for n,p in mod.named_parameters(): print(n,p.shape)
print("nparms",sum(p.numel() for p in mod.parameters() if p.requires_grad))
train(mod)
import sys
import torch.nn as nn
import torch
import numpy as np
dn = 50.
h=10
nepochs=100
with open("meteo/2019.csv","r") as f: ls=f.readlines()
# divides by dn to normalize inputs (temp is below dn=50 degree)
# removes last elt (:-1) because it has no label (cannot predict next temp)
trainx = torch.Tensor([float(l.split(',')[1])/dn for l in ls[:-1]]).view(1,-1,1)
# we want to predict y=x[3] from x[0,1,2] (because kernel size = 3)
# so the label of the first batch should be x[mod.kern], and so on...
trainy = torch.Tensor([float(l.split(',')[1])/dn for l in ls[3:]).view(1,-1,1)
with open("meteo/2020.csv","r") as f: ls=f.readlines()
testx = torch.Tensor([float(l.split(',')[1])/dn for l in ls[:-1]]).view(1,-1,1)
testy = torch.Tensor([float(l.split(',')[1])/dn for l in ls[3:]).view(1,-1,1)
# trainx = 1, seqlen, 1
# trainy = 1, seqlen, 1
trainds = torch.utils.data.TensorDataset(trainx, trainy)
trainloader = torch.utils.data.DataLoader(trainds, batch_size=1, shuffle=False)
testds = torch.utils.data.TensorDataset(testx, testy)
testloader = torch.utils.data.DataLoader(testds, batch_size=1, shuffle=False)
crit = nn.MSELoss()
class Mod(nn.Module):
def __init__(self,nhid):
super(Mod, self).__init__()
self.cnn = nn.Conv1d(1,nhid,3)
self.kern = 3
self.mlp = nn.Linear(nhid,1)
self.relu = nn.ReLU()
def forward(self,x):
# x = B, T, d
# we want B, d, T (see doc Conv1d)
xx = x.transpose(1,2)
y=self.cnn(xx)
y=self.relu(y)
B,H,T = y.shape
yy = y.transpose(1,2)
# yy = B, T, d
y = self.mlp(yy.view(B*T,H))
y = y.view(B,T,-1)
return y
# mod=Mod(h)
# x=torch.rand(1,7,1)
# print(x.shape)
# y=mod(x)
# print(y.shape)
# exit()
def test(mod):
mod.train(False)
totloss, nbatch = 0., 0
for data in testloader:
inputs, goldy = data
haty = mod(inputs)
# goldy = B, T, d
loss = crit(haty,goldy)
totloss += loss.item()
nbatch += 1
totloss /= float(nbatch)
mod.train(True)
return totloss
def train(mod):
optim = torch.optim.Adam(mod.parameters(), lr=0.001)
for epoch in range(nepochs):
testloss = test(mod)
totloss, nbatch = 0., 0
for data in trainloader:
inputs, goldy = data
optim.zero_grad()
haty = mod(inputs)
loss = crit(haty,goldy)
totloss += loss.item()
nbatch += 1
loss.backward()
optim.step()
totloss /= float(nbatch)
print("err",totloss,testloss)
print("fin",totloss,testloss,file=sys.stderr)
mod=Mod(h)
print("nparms",sum(p.numel() for p in mod.parameters() if p.requires_grad),file=sys.stderr)
train(mod)
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
\[s_t = f(Ux_t + W s_{t-1})\]
Back-Propagation Through Time:
Exemple: gradient with respect to output 3: \[\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}\]
Derivative of the \(tanh\) activation:
The gradient of the activation is often close to 0
When many such gradients are multiplied, the gradient becomes smaller and smaller, negligeable
After some time steps, no more training is possible
Exploding gradients are also possible, but can be solved with clipping
Cho et al., 2014
Other models zoo
Other models zoo
git clone git@github.com:...
git pull origin main
git add myfile.py
git commit -am "I fixed the bug"
git pull origin main
git push origin main
git rm oldfile.py
git mv oldfile.py newfile.py
git add file_that_was_in_conflict.py
git commit -am "conflict solved"
git push