Overfitting & double descend

Why does SGD work so well?

3 reasons identified so far (paper):

  • Hyp 1 (“classic” hyp.): because it computes noisy gradients, which regularizes training
    • “the ratio of the stepsize 𝜂 to the batch-size 𝑏, which essentially controls the magnitude of the stochastic gradient noise”

Why does SGD work so well?

  • The “noise” introduced by SGD is structured:
    • vanishes near global optima
    • stays in same low-dim manifold as gradients
  • See Francis Bach’s blog
    • analyzes a continuous form of SGD aka “gradient flows”
    • in overparameterized regime and with simple networks, proves that this structured noise makes SGD converge towards either L1 or L2 minima
    • Label noise also helps!

Better L1 or L2?

  • \(f=\) linear regression with feature map: \(f(x)=<\beta,\phi(x)>\) \[\phi(x) = (\frac 1 k \cos(2\pi kx), \frac 1 k \sin(2\pi kx))_{k\leq d/2}\]

Better L1 or L2?

SGD vs. GD

Why does SGD work so well?

  • Hyp 2: “flatness” of the local minimum found by SGD, which is related to the eigenvalues of the Hessian
    • Zhou paper studies SGD as a stoch. diff. equation
    • Show that the noise in SGD enables to more easily escape basin of attractions than Adam
    • see Bengio’s paper
    • high values of the ratio of learning rate to batch size leads to wider minima
    • cf. chapter 6 of this course

Why does SGD work so well?

Bias-variance tradeoff

Depends on both classifier + learning method

  • Let \(Z\) be a random variable with distrib \(P\)

  • Let \(\bar{Z} = E_P[Z]\) be the average value of \(Z\)

Lemma: \[E\left[ (Z-\bar{Z})^2 \right] = E[Z^2] - \bar{Z}^2\]

  • Let \(f\) be a real function, and our task be \(p(y|x)\) s.t. \(y=f(x)+\epsilon\)
    • with \(\epsilon \sim N(0;\sigma^2)\)
  • We fit an hypothesis \(h(x) = w\cdot x +b\) with MSE over a training corpus

  • Let be a new data point \(y = f(x) + \epsilon\):
    • What is the expected error ? \[E\left[ (y-h(x))^2 \right]\]

Expected over the training corpora of \(h\): \(x\) is constant

\(E[(h(x)-y)^2] = E[h^2(x)] -2E[h(x)]E[y] + E[y^2]\)

Apply the previous lemma to both square terms:

\(E[(h(x)-\bar{h(x)})^2] + \bar{h(x)}^2 -2\bar{h(x)}f(x) + E[(y-f(x))^2] + f(x)^2\)

Factorize:

\(E[(h(x)-\bar{h(x)})^2] + (\bar{h(x)} - f(x))^2 + E[(y-f(x))^2]\)

  • (Wrong) intuition: it’s more important to reduce bias
    • Yes, on the average in the long run, low-bias models may perform well
    • But long run avgs are irrelevant in practice !
  • Bagging helps to reduce variance:
    • resample corpus \(\rightarrow\) train multiple models
    • average predictions
    • Ex: Random forests
  • Large bias: underfitting
    • Increase nb of parameters
  • Large variance: overfitting
    • More regularization

Under/over-fitting

  • Model “capacity” depends on nb of parameters + topology

How to find this optimum ?

  • There are theoretical measures (see Aikake’s Information Criteria)
  • But much better to use a development corpus

Overfitting in practice

  • Fixed model capacity
  • Never stop at “Nice, it’s compiling and running, I’m done!”:
    • Try to overfit on 1 batch, 10, 100…
    • Play with regularization
    • Find the broad “overfitting threshold”
    • What if I double the capacity?
  • Art: interpreting the train and dev loss curves

Experiment, analyze, experiment, analyze…

Train/dev loss curves

  • C1: good curve: model around the overfitting threshold

  • C2: model (too far?) on the right of the overfitting threshold

  • C3: model (too far!) on the left of the overfitting threshold

  • C4: convergence issue: check LR

  • C5: training loss=0: model capacity large (too large?)

  • C6: large training loss: model capacity too small

  • C7: unstable training loss: reduce LR?

  • C8: unstable dev loss: did you average it over the corpus?

Computing the loss

  • Can be averaged over the whole corpus:
    • dev corpus usually small ==> OK
    • train corpus huge! LLMs are often trained over a single epoch
  • Can be averaged over a batch
    • “just report every loss computed”
    • the smaller the batch, the larger the variability
    • running average since the beginning of the corpus?
  • Can be averaged over \(n\) batches
    • much smoother curves, easier to interpret
    • but which value of \(n\)?

Train vs. Dev absolute loss values

  • Are the train and dev loss values comparable?
  • Sometimes, yes, often no:
    • “train” vs. “eval” mode: dropout…
    • should be representative of the distribution = averaged over enough data

Dev loss or accuracy curve?

  • cross-entropy is strongly related to accuracy, but is different
  • The loss can decrease either because:
    • the nb of errors decrease
    • the model becomes “over-confident” in its predictions

Overfitting is not bad-1

  • Tishby’s “information bottleneck”

Tishby’s bottleneck

(Tishby, 2015)
https://arxiv.org/abs/1503.02406

  • 2 phases: overtraining, then compression

Overfitting is not bad-2

  • “double descent” (Belkin at al., arxiv 1812.11118, 2018)

Let us consider the following regression problem:

  • gold (unknown) function \(f(x)=4(x-\frac 5 2)^2\)
  • observations: 6 samples
    • (0,25) (1,9) (2,1) (3,1) (4,9) (5,25)

Inspired by https://mlu-explain.github.io/double-descent2/

  • model = piecewise linear fonction with fixed “knot points” \(t_k\):

\(g(x) = \alpha + \beta x + \sum_{k=1}^K \gamma_k \phi_k(x)\)

with

\(\phi_k(x) = \frac 1 2 |x-t_k|\)

  • parameters = \((\alpha,\beta,\gamma_k)\)
  • you can show \(g(x)\) is a parametrization of piecewise linear functions
    • the slope between two knots is constant = weighted sum of \(\gamma_k\)
    • you can make it equal to any piecewise linear = \(K\) linear eqs with \(K\) unknowns
  • case \(K=0\): linear regression
  • Implementation in pytorch: model:
class Knots(nn.Module):

    def __init__(self):
        super(Knots, self).__init__()
        self.alpha = nn.Parameter(torch.Tensor([0.]))
        self.beta  = nn.Parameter(torch.Tensor([0.]))

    def forward(self, X):
        return self.alpha + self.beta * X
  • data:
traind = torch.utils.data.TensorDataset(
        torch.Tensor([0,1,2,3,4,5])/10.,
        torch.Tensor([25,9,1,1,9,25])/100.)
train_loader = torch.utils.data.DataLoader(traind, batch_size=6, shuffle=True)
  • training loop:
def learn():
    nepochs = 50000
    lossf = nn.MSELoss()
    mod = Knots()
    opt = torch.optim.Adam(mod.parameters(), lr=0.01)
    for ep in range(nepochs):
        n,totloss = 0,0.
        for x,y in train_loader:
            n+=1
            opt.zero_grad()
            haty = mod(x)
            loss = lossf(haty,y)
            totloss += loss.item()
            loss.backward()
            opt.step()
        totloss /= float(n)
        print("TRAINLOSS",totloss,mod.alpha.item(),mod.beta.item())
  • Exercice: plot the loss curve with \(K=0\)
  • You should obtain:
    • horizontal line
    • loss plateaus at 0.11
    • underparametrized regime
    • model underfits
  • With \(K=1\), you get (see exercice):
  • At the interpolation threshold: solution determined by the knots:

  • single choice per knot positions

  • Let’s compute the expected error, when sampling the first knot uniformely between 1 and 2
  • Let’s assume the first knot is at t0: \(P(t_0 \in [2-\epsilon; 2-\epsilon/2]) \propto \epsilon\)
  • assume \(\epsilon<0.9\), then the knot is at best \((2-\epsilon,0.9)\) and the slope \(\geq 0.1/\epsilon\)
  • so the error (q-norm) is at least in the order of \((1/\epsilon)^q\)
  • Let’s consider the population of samples \((\epsilon = 1/2^m)_m\), each with probability \(\propto \epsilon\)
  • The expected error is greater than the sum of the errors:

\[E[||\hat f - f||_q] \geq \sum_m^{\infty} \frac 1 {2^m} (2^m)^q = \sum_m^{\infty} 2^{m(q-1)}\]

  • The error is infinite !
  • loss curve at \(K=4\):

  • Over-parametrized model:
    • more than one possible solution
  • Let \(\delta_i\) be the slope of segment \(i\)
  • \(\delta_{i+1}-\delta_i = \gamma_i\)
    • as when crossing \(t_i\), a single weight changes its sign in the sum of slopes \(g(x) = \alpha + \beta x + \sum_{k=1}^K \gamma_k \phi_k(x)\)
  • Let \(\Delta_i\) be the distance btw the center of both segments
  • Let’s assume we have many knots, so \(\frac {\delta_{i+1}-\delta_i}{\Delta_i} = \frac {\gamma_i} {\Delta_i}\) approximates the second derivative \(g''(t_i)\)
  • Let \(p(x)\) be the uniform proba to sample \(t_i\)
  • The proba that \(t_i\) lies in \(\Delta_i\) is approx. \(\Delta_i p(t_i)\)
  • Alternative view: we know one of the \(K\) knots is there, so the proba it’s \(t_i\) is \(1/K\) \[\Delta_i \simeq \frac 1 {Kp(t_i)}\]
  • So let us regularize our MSE loss by minimizing \(||\gamma||^2\):

\[||\gamma||^2 = \sum_k^K \gamma_k^2 \simeq \sum_k^K g''(t_k)^2 \Delta_k^2 \simeq \frac 1 K \sum_k^K \frac {g''(t_k)^2}{p(t_k)} \Delta_k\]

  • This is an approx. of the integral of \(\frac {g''(t_k)^2}{p(t_k)}\)

\[||\gamma||^2 \simeq \frac 1 K \int \frac {g''(x)^2}{p(x)}dx\]

  • assume \(p(x)\) is uniform on the domain, and \(g''(x)=0\) outside the domain, then we minimize \(\int_a^b g''(x)^2 dx\)
  • so we look for an interpolation that is as close to linear as possible
  • == natural cubic spline
  • Extension to Multi-Layer Perceptron: proof by Singh & al., ICLR’22
  • assumptions:
  • MLP \(f()\) trained on data \(S\sim D\) to minimum \(\theta^*\)
  • Empirical loss \(L_S\)
  • covariance of loss gradients: \(C_L^{D} (\theta) = E_{z\sim D} [\nabla_{\theta} l(z,\theta) \nabla_{\theta} l(z,\theta)^T]\)
  • covariance of function Jacobians: \(C_f^{S} (\theta) = \frac 1 {|S|} Z_S(\theta) Z_S(\theta)^T\)
  • with \(Z_S(\theta)=[\nabla_{\theta} f_{\theta}(x_1), \dots, \nabla_{\theta} f_{\theta}(x_{|S|})]^T\)
  • \(H_L^S(\theta^*)\) = Hessian of \(L\) estimated on the training corpus \(S\)
  • \(r\) = rank of Hessian;
  • \(\lambda_1,\dots,\lambda_r\) = eigenvalues of Hessian in decreasing order
  • \(\lambda_{min}(A)\) = smallest eigenvalue of matrix \(A\);
  • \(\sigma^2_{(x,y)} = ||\nabla_f l(f_{\theta^*}(x),y)||^2\)
  • \(\exists\) a sample \((x,y) \sim D\) s.t. \(\sigma^2_{(x,y)>0}\)
  • \(\alpha\) is the probability of \((x,y)\) in D;
  • \(\tilde D = \{(x,y) \sim D : \sigma^2_{(x,y)}>0\}\);
  • \(\sigma^2_{min} = \min_{(x,y)\sim \tilde D} \sigma^2_{(x,y)}\)

\(L(\theta^*) \geq \tilde {L_S}(\theta^*) + \frac 1 {n+1} \frac {\sigma_{min}^2 \alpha \lambda_{min} \left( C_f^{\tilde D} (\theta^*) \right)} {\lambda_r \left( H_L^S(\theta^*) \right) }\)

  • with the “one-sample training loss” \(\tilde {L_S}(\theta^*) = E_{z\sim D}[l(z,\theta^*_{S\cup \{z\}})]\)
  • What is important = \(\frac 1 {\lambda_r\left( H_L^S(\theta^*) \right)}\)
  • Assume MSE Loss (and a few other hypothesis…):

\(L(\theta^*) \geq \tilde {L_S}(\theta^*) + \frac {\sigma_{min}^2 \alpha A \lambda_{min} \left( C_f^{\tilde D} (\theta^0) \right)} {B ||C_f^{D} (\theta^0)||_2 \left( \sqrt n - c \sqrt p \right)^2 }\)

  • \(A\), \(B\) and \(c\) are constant
  • \(\theta^0\) is initial model
  • \(n\)=samples \(p\)=parameters
  • Infinite for \(p=\frac 1 {\sqrt c}n\)

Grokking

  • First observed by Google in Jan 2022
    • Trained on arithmetic tasks
    • Must wait well past the point of overfitting to generalize
  • It often occurs “at the onset of Slingshots” Apple paper
    • Slingshots = type of weak regularization = when “numerical underflow errors in calculating training losses create anomalous gradients which adaptive gradient optimizers like Adam propagate” blog
    • post-grokking solutions = flatter minima
  • Training involves multiple phase: grokking is one of them MIT paper
    • generalization, grokking, memorization and confusion
  • When representations becomes structured, then generalization occurs:

  • The 4 phases depend on hyper-parameters:

  • Why do we observe phase during training?
    • NYU paper 2023
    • Because of competing sub-networks: dense for memorization and another sparse for generalization

  • You have seen:
    • template code for pytorch
    • deep learning as differentiable programming
    • XPs for underfitting, interpolation point, over-parameterized
    • importance of regularization
    • double descent: why and when; grokking

Mini-project

  • To be done in groups of 2 (or 1 student)
  • Must give me by email a github link (or zip file with full GIT history, including .git/ dir) that contains:
    • A report in markdown or pure text format, with links to figures/curves
    • A code in python + pytorch
  • deadline = 15 Decembre 2023
  • Work to be done:
    • Choose a public dataset that you can train a deep learning model on
    • Create 1+ model(s) in pytorch that is based on/a combination of MLP, LSTM, ConvNet, Transformer
      • It’s OK to get inspiration from some code on the web, but you shall not just reuse one, and you must state in your report where you got inspiration from
    • Train it, evaluate, analyze and if possible iterate this process a few times
  • Criteria used for grades:
    • Quality and depth of your analysis
    • Balanced contribution in the group (through GIT history)
    • Work progress over time (through GIT history)
  • The report is by far the most important!
    • between 5 and 15 “pages”

Practice “RNN overfit”

  • Download Meteo data
  • Goal = predict the next T° given a seq of T°
  • train on 2019, test on 2020
  • Objectives:
    • template code for RNN & time series
    • use of Dataset & DataLoader
    • explore underfitting/overfitting
    • understand sources of variability
  • Data: convert to tensor, then to Dataset, then to DataLoader:
# trainx = 1, seqlen, 1
# trainy = 1, seqlen, 1
trainds = torch.utils.data.TensorDataset(trainx, trainy)
trainloader = torch.utils.data.DataLoader(trainds, batch_size=1, shuffle=False)
testds = torch.utils.data.TensorDataset(testx, testy)
testloader = torch.utils.data.DataLoader(testds, batch_size=1, shuffle=False)
crit = nn.MSELoss()
  • Model: simple RNN with a linear regression layer:
class Mod(nn.Module):
    def __init__(self,nhid):
        super(Mod, self).__init__()
        self.rnn = nn.RNN(1,nhid)
        self.mlp = nn.Linear(nhid,1)

    def forward(self,x):
        # x = B, T, d
        ...
  • Every day, the RNN predicts the next day T°:
    • \(X\) is a tensor (Batch, Time, Dim) = (1, 365, 1)
    • \(\hat y_t = f(x_t)\)
    • loss: \(||\hat y_t - x_{t+1}||^2\)
    • all 365 predictions shall be processed in a single pass !
      • Look at both outputs of RNN in pytorch.org/docs
  • compute MSE on test:
def test(mod):
    mod.train(False)
    totloss, nbatch = 0., 0
    for data in testloader:
        inputs, goldy = data
        haty = mod(inputs)
        loss = crit(haty,goldy)
        totloss += loss.item()
        nbatch += 1
    totloss /= float(nbatch)
    mod.train(True)
    return totloss
  • Training loop:
def train(mod):
    optim = torch.optim.Adam(mod.parameters(), lr=0.001)
    for epoch in range(nepochs):
        testloss = test(mod)
        totloss, nbatch = 0., 0
        for data in trainloader:
            inputs, goldy = data
            optim.zero_grad()
            haty = mod(inputs)
            loss = crit(haty,goldy)
            totloss += loss.item()
            nbatch += 1
            loss.backward()
            optim.step()
        totloss /= float(nbatch)
        print("err",totloss,testloss)
    print("fin",totloss,testloss,file=sys.stderr)
  • The Main:
mod=Mod(h)
print("nparms",sum(p.numel() for p in mod.parameters() if p.requires_grad),file=sys.stderr)
train(mod)
  • Step 1: copy/paste + complete this code and run a training
    • check convergence!
  • Step 2: compare and analyze \(h=1\), \(h=100\)
  • Step 3: rerun multiple times: what’s the variability and why?

Practice “Knots”

  • Underfitting regime:
    • Take the “knots” code with \(K=0\) in the course and modify it to add 1 parameter \(K=1\)
    • Plot the loss curve and try to obtain a similar one as in the course
  • Overfitting regime:
    • modify the code with \(K=4\)
    • generate a test set composed of 20 uniform samples and plot both the train and test loss
    • try to get curves that clearly overfit

Practice tensors

Book: deep learning with pytorch

  • Exercice from B. Favre

  • Avec l’aide de la documentation pytorch, calculez l’expression suivante: \[a = 1_{(3 \times 2)}\] \[b = \sin(1+\sqrt{3I_2} + 5 ||a||)\]

Exercice: combine CNN and RNN

  • you have a “long” time series with 3000 timesteps. You want to handle it by using first a CNN with kernel=3 and stride=3 that will reduce it to 1000 timesteps, and then an LSTM.
  • The goal is to classify the full sequence into N classes (e.g., whether the sequence of temperature has been recorded in winter or summer…)
  • write both methods of the model: init() and forward()