Uncertainty in deep learning

  • Overview:
    • Reminder of Bayesian inference
    • Uncertainty at the output of DNN
    • Uncertainty in latent layer
    • Bayesian Neural Networks
  • refs:
    • https://www.youtube.com/watch?app=desktop&v=lhwk4ESlyMA&feature=youtu.be
  • Goal1: compute how reliable the model output is
    • we know the output of MLP classifier = \(p(y|x)\)
    • but we know it overestimates the probability of its prediction
  • Better method: conformal prediction sets
    • paper
    • use a held-out dataset to estimate proba that model is correct
    • for any target confidence, returns a set of possible labels

Bayes vs. DNNs

  • Basic DNNs do not consider the process that generates data:
    • They consider only fixed-point data samples
    • Given one point \(x\), they compute one output \(y\)
  • Exemple:
    • inputs = day, month, location
    • output = temperature
  • A fixed point is not enough: there’s a lot of variability
    • wind, previous weather in neighbouring places…
  • We’d prefer mean + stdev
  • 1st source of uncertainty:
    • many factors are unknown
  • 2nd source of uncertainty:
    • our model is limited
    • ex: input = history + geo-grid of T°,P,W,rain,lum
    • output still erronous
  • Bayesian networks
    • model the distribution of \(y\) given \(x\)
    • Get uncertainty of the prediction
    • Enable sampling various good outputs
    • You may include prior knowledge
      • e.g., T° increase over years because of global warming
  • But BN are
    • less powerful than DNNs
    • very costly to train (in general)
  • Can we do the same with DNNs ?

Brief reminder of Bayesian models

  • Bayesian models assume all vars are generated from a process:
    • Variable \(X\) is sampled from a distribution, which is usually assumed parametric: \[X \sim Mult(\theta)\]
    • \(\theta\) usually trained with MLE, MAP or Bayesien inference
  • Maximum Likelihood
    • maximum likelihood <=> cross-entropy
    • data follow a true distribution \(P^*(X)\), approximated by parametric distrib \(P(X;\theta)\)
    • MLE: \(\theta^* = \arg\max_\theta P(X;\theta)\)
  • Assume iid samples:
    • \(\theta^* = \arg\max_\theta \prod_i P(X_i;\theta) = \arg\max_\theta \sum_i \log P(X_i;\theta)\)
    • \(\theta^* = \arg\min_\theta E_{X \sim P^*(X)} \left[\log \frac 1 {P(X;\theta)} \right]\)
    • which is the equation of the cross-entropy

Limits of Maximum Likelihood

  • Only relies on the evidence
  • Does not take into account prior knowledge
  • but why bother about priors?
  • Ex: we know that, in July, T° is usually positive
  • MLE will “rediscover” this fact through data
    • bad use of evidence / training time
    • much more efficient to tell it beforehand
  • Prior knowledge makes better use of few data
  • Bayesian models assume that we have some knowledge about the distribution before observing anything
    • prior distribution \(P(\theta)\)
    • Now \(\theta\) is a random variable !
    • If the prior knowledge is good enough, then we don’t need data at all
  • Combining prior knowledge + data: (1)
    • We want to find the best \(\theta\) according to both prior and likelihood
    • maximum a posteriori: \(\max_\theta P(\theta|X) = \max_\theta P(X|\theta) P(\theta)\)
    • L2 regularization <=> MAP with a Gaussian prior on the weights
  • Combining prior knowledge + data: (2)
  • Estimate the full posterior: observations (evidence) come in to modify this prior belief:
    • Transforms the prior \(P(\theta)\) into the posterior \(P(\theta|X)\)
    • = Bayesian Inference

SGD == Approx. Bayesian Inference

https://arxiv.org/abs/1704.04289

  • constant-rate SGD simulates a Markov chain with a stationary distrib
  • possible to adjust the LR of constant SGD to best match the stationary distrib to a posterior

Distributions in DNNs

  • If we use a vanilla DNN, do we have any information about the distribution of possible outputs?
  • How can we modify the vanilla DNN to get more information?
  • Vanilla DNN:
    • Can we measure uncertainty of the output?
  • DNN that outputs a distribution (Mixture Density Network)
  • DNN with a latent Random Variable
    • RBM, DBN
    • VAE
  • Bayesian DNN

Can we still get a distribution out of a vanilla DNN ?

  • Famous examples: generative model:
    • A generative model samples from a density of probability
    • VAE: the latent \(z\) is randomly sampled from a Gaussian distribution
    • GAN: the generator is fed with a random input

Softmax approximates posterior

  • Multi-class classification:
    • Objective = classify an input \(x\)
      • e.g., does an utterance express joy, anger, sadness… ?
      • Celebs pictures: who is she/he ?
    • One output neuron per class
      • each output neuron gives a score (logits)
      • normalized by softmax –> probability
      • … which probability ?
  • (Bishop,94) proves that a MLP trained with cross-entropy (and infinite data) approximates the posterior \(P(y|x)\)
  • But in practice, not very reliable to estimate uncertainty
    • Gal & Ghahramani (2016) show that a model can be uncertain in its predictions even with a high softmax output.
    • Calibration
  • Bishop proposed another analysis to estimate output distrib with MSE

Gaussian approximation

(Bishop94):

  • assume infinite data \((x,y)\), a DNN \(f(\cdot;\theta)\); the MSE error is:

\[\int \int (f(x;\theta) - y)^2 p(x,y) dx dy\]

\[(f(x;\theta) - y)^2 = (f(x;\theta) - y + E[y|x] - E[y|x])^2\]

\[\begin{eqnarray} (f(x;\theta) - y)^2 &=& (f(x;\theta) - E[y|x])^2 + (E[y|x]-y)^2 + \\ && 2(f(x;\theta) - E[y|x])(E[y|x]-y) \end{eqnarray}\]

\[\begin{eqnarray} (f(x;\theta) - y)^2 &=& (f(x;\theta) - E[y|x])^2 + E[y|x]^2 + y^2 -2E[y|x]y +\\ && 2f(x;\theta)E[y|x] -2f(x;\theta)y - 2E[y|x]^2 + 2E[y|x]y \end{eqnarray}\]

\[\begin{eqnarray} (f(x;\theta) - y)^2 &=& (f(x;\theta) - E[y|x])^2 + y^2 + 2f(x;\theta)E[y|x] - \\ && 2f(x;\theta)y -E[y|x]^2 \end{eqnarray}\]

Consider the first term in \(2f(x;\theta)...\) into the integral:

\[\begin{eqnarray} \int 2f(x;\theta)E[y|x] p(y|x)p(x) dy &=& 2f(x;\theta) p(x) \int E[y|x] p(y|x) dy \\ &=& 2f(x;\theta) E[y|x] p(x) \end{eqnarray}\]

for the other term: \[\begin{eqnarray} \int 2f(x;\theta)y p(y|x)p(x)dy &=& 2f(x;\theta) p(x) \int y p(y|x)dy \\ &=& 2f(x;\theta) E[y|x] p(x) \end{eqnarray}\]

so \[\begin{eqnarray} \int (f(x;\theta) - y)^2 p(x,y)dy &=& p(x)(f(x;\theta) - E[y|x])^2 +\\ && p(x)(E[y^2|x] - E[y|x]^2) \end{eqnarray}\]

so the MSE error is \[\int (f(x;\theta) - E[y|x])^2 p(x)dx + \int (E[y^2|x] - E[y|x]^2)p(x)dx\]

The MSE error only depends on the model parameters through the first term, and it is minimum when \(f(x;\theta) = E[y|x]\)

  • Consequently, if we want to model uncertainty of the DNN output with a Gaussian, we can interpret that, after training, the DNN:
    • outputs the expected value (mean)
    • The variance of the error (=uncertainty) can be computed from the corpus (second term)
  • OK, we have way to estimate the uncertainty from the corpus
  • but can we let the DNN computes it itself ?

Gaussian mixture approximation

  • (Bishop, 94) proposes the “Mixture Density Network”:
    • the input X is passed into an MLP with 3 outputs:
      • \(K\) mixing values \(\alpha\), which are further passed into a softmax
      • \(K\) means \(\mu\)
      • \(K\) variances \(\sigma\), which are further passed into some positive-only function
  • the final output of the model is a mixture of Gaussians: \(\sum_k \alpha_k N(x;\mu_k,\sigma_k)\)
    • The output is a probability distribution that represents the possible values of \(y\) given \(x\)
    • So if you want just one value \(y\), you have to sample
  • See Bishop’s report on MDN

DNNs with latent random variable

  • We have seen how to model the parameters of a GMM with a DNN
  • It is possible to model parametric distributions inside a DNN, but only for a Gaussian.
  • This may seem very limited, but the following DNN layers may transform this Gaussian into any other distribution!

VAE

  • Explicitly model a distribution
  • Encoder transforms input \(x\) into a distribution \(p(z)\sim N(\mu,\sigma)\)
  • Decoder transforms a sample \(z\sim p(z)\) into the output \(\hat x\)
  • We want to train the VAE with MLE, but the likelihood is hard to compute: \[\log p(x) = \log \int p(z)p(x|z)dz\]
  • variational principle:
    • convert difficult computation into optim. problem
    • we chose a variational family \(q(z|x) \in Q\)
  • How to get the ELBO (Evidence Lower BOund): \[\log p(x) = \log \int \frac {q(z|x)}{q(z|x)}p(x,z)dz = \log E_q \frac {p(x,z)}{q(z|x)}\]
    • the log of a sum is not easy to differentiate…
  • Much better to get the sum of the log:
  • Jensen’s inequality gives the ELBO: \[\geq E_{q(z|x)} \log \frac {p(x,z)}{q(z|x)}\]
  • Note: when we maximize the lower bound, we actually make \(q\) closer to \(p\): \[\max_q E_{q(z|x)} \log \frac {p(x,z)}{q(z|x)} = \log p(x) - \min_q D(q(z|x)||p(z|x))\]

  • So MLE involves now a double optimization (\(\hat p\) is the empirical corpus): \[\max_{p,q} E_{\hat p(x)}E_{q(z|x)} \log \frac {p(x,z)}{q(z|x)}\]

  • Let us consider encoder parameters \(\phi\) and decoder parameters \(\theta\) \[\max_{\theta,\phi} E_{\hat p(x)}E_{q_\phi(z|x)} \log \frac {p(z)p_\theta(x|z)}{q_\phi(z|x)}\]

  • SGD involves computing \[\nabla_\phi E_{q_\phi(z|x)} \log \frac {p(z)p_\theta(x|z)}{q_\phi(z|x)}\]

    • hard: cannot be expressed as an expected value of a gradient
    • let’s make it not dependent on \(\phi\): reparameterization trick
  • Instead of sampling from \(q\), we sample from a fixed distribution \(p(\epsilon)\): \[\nabla_\phi E_{q_\phi(z|x)} \log \frac {p(z)p_\theta(x|z)}{q_\phi(z|x)} = \nabla_\phi E_{p(\epsilon)} \log \frac {p(T(\epsilon))p_\theta(x|T(\epsilon))}{q_\phi(T(\epsilon)|x)}\]

  • which is possible when \(q_\phi(z|x)\) is a conditional diagonal Gaussian model with: \[p(\epsilon)=N(0,1)\] \[T(\epsilon)=\mu_\phi(x) + \epsilon \cdot \sigma_\phi(x)\]

  • We now have: \[\nabla_\phi E_{p(\epsilon)} \log \frac {p(T(\epsilon))p_\theta(x|T(\epsilon))}{q_\phi(T(\epsilon)|x)}= E_{p(\epsilon)} \nabla_\phi \log \frac {p(T(\epsilon))p_\theta(x|T(\epsilon))}{q_\phi(T(\epsilon)|x)}\]
  • can be estimated with Monte Carlo!

In practice

  • given \(x\), the encoder computes \(\mu_\phi(x)\) and \(\sigma_\phi(x)\)
  • sample an \(\epsilon \sim N(0,1)\)
  • build the latent representation \(z=\mu_\phi(x) + \epsilon \cdot \sigma_\phi(x)\)
  • pass \(z\) to the decoder \(\hat x = p_\theta(x|z)\)
  • optimize sum of:
    • standard loss (MSE, cross-ent) for \(\theta\): \(l(\hat x, x)\)
    • regularizer for \(\phi\): \(KL(q_\phi(z|x)||N(0,1))\)
  • and for a given \(x\): \[KL(q_\phi(z|x)||N(0,1))= \mathbb{KL}\left( \mathcal{N}(\mu, \sigma) \parallel \mathcal{N}(0, 1) \right)\] \[ = \sigma^2 + \mu^2 - \log \sigma - \frac{1}{2}\]

General pb: non-diff

  • Doubly stochastic system:
    • stochastic component: sampling in the DNN
    • stochastic optimization: sampling a batch in SGD
  • Let’s define \(\hat x \sim p_\theta(x)\) the (random) output of the DNN
  • \(f\) is the loss fct, we can estimate the expected loss with MC: \[E_{p_\theta}[f(x)] \simeq \frac 1 N \sum_i^N f(\hat x^{(i)})\]
  • for SGD, we need the gradient of the expected loss: \[\nabla_\theta E_{p_\theta}[f(x)] = \nabla_\theta \int p_\theta(x) f(x)dx = \int \nabla_\theta p_\theta(x) f(x)dx\]
  • but cannot be cast as an expectation, because \(\nabla_\theta p_\theta(x)\) is not a density (may be <0)
  • same thing for the ELBO: \[\nabla_\theta ELBO(q) = \nabla_\theta E_q [\log p(x,z) - \log q(z)]\]
  • 2 solutions:
    • score function estimator (REINFORCE)
    • pathwise gradient estimator (Reparameterization)
  • Sol 1: score function estimator
    • a.k.a. REINFORCE, or likelihood ratio estimator
    • unbiased, no restriction on \(p_\theta(x)\) and \(f(x)\)
    • but it has large variance: might give very different \(\theta\)
  • def: score = \(\nabla_\theta \log p_\theta(x)\)

\[\nabla_\theta E_{p_\theta}[f(x)] = \int \nabla_\theta p_\theta(x) f(x)dx\] \[=\int p_\theta(x) \frac {\nabla_\theta p_\theta(x)}{p_\theta(x)} f(x)dx\] \[=\int p_\theta(x) \nabla_\theta \log p_\theta(x) f(x)dx\] \[=E_{p_\theta}[\nabla_\theta \log p_\theta(x) f(x)]\]

Can be estimated with Monte Carlo !

  • For the ELBO: \[\nabla_\theta E_q [\log p(x,z) - \log q(z)] = \nabla_\theta E_q [f_\theta(z)]\] \[ = \int \nabla_\theta (q_\theta(z) f_\theta(z))dz\] \[ = \int \nabla_\theta q_\theta(z) f_\theta(z)dz + q_\theta(z) \nabla_\theta f_\theta(z)dz\] \[ = E_q [\nabla_\theta \log q_\theta(z)(\log p(x,z)-\log q_\theta(z))]\]

Can be estimated with Monte Carlo ! But pb: large variance

  • Sol 2: pathwise gradient estimator
    • instead of direct sampling \(\hat x \sim p_\theta(x)\)
    • indirect sampling: \(\hat x = t_\theta(\hat\epsilon)\) with \(\hat\epsilon \sim p(\epsilon)\)
    • \(p(\epsilon)\) = base distribution
    • \(t_\theta(\hat\epsilon)\) = sampling path (deterministic)
  • example: \(x=t_\theta(\epsilon)=\mu + \sigma\epsilon\) and \(N(\epsilon;0,1)\)
  • law of unconscious statistician: \[E_{p_\theta(x)}[f(x)] = E_{p(\epsilon)}[f(t_\theta(\epsilon))]\]

\[\nabla_\theta E_{p_\theta(x)} [f(x)] = \nabla_\theta E_{p(\epsilon)}[f(t_\theta(\epsilon))]\] \[\nabla_\theta E_{p_\theta(x)} [f(x)] = \int p(\epsilon)\nabla_\theta f(t_\theta(\epsilon))d\epsilon\] \[\nabla_\theta E_{p_\theta(x)} [f(x)] = \int p(\epsilon)\nabla_x f(x) \nabla_\theta t_\theta(\epsilon)d\epsilon\]

  • The loss must be differentiable!
  • pathwise estimator has better variance, but it requires
    • differentiable loss
    • rewriting the sampling of \(x\) with a deterministic function with all parameters

Bayesian DNN

So far:

  • Standard DNN= trained with MLE (point estimate)
  • L2-Regularized DNN= trained with MAP (Gaussian prior, point estimate)
  • What would be a better training procedure ?
  • posterior inference ! But very difficult.
    • old approx 1 = Laplace’s method
    • old approx 2 = MCMC (long convergence)
    • recent approx = BNN with variational inference
  • BNN = weights are Random Variable (stochastic networks)
  • tuto: cf https://arxiv.org/pdf/2007.06823.pdf
  • toolkit = pyro
  • we want the full predictive distribution

\[p(y|D,x) = \int p(y|w,x) p(w|D) dw\]

  • several ways to approximate it (with different names)

\[p(y|D,x) = \int p(y|w,x) p(w|D) dw\]

  • approximate predictive distribution with MCMC:
    • sample a few possible w
    • combine the resulting model
    • … == deep ensembles !
  • Monte Carlo Dropout
    • proposed by Gal & Ghahramani (2016)
    • dropout = averaging ensemble of many different networks
    • apply dropout also during testing
    • == approx Bayesian inference
  • Stochastic Weight Averaging
    • combine weights of the same network at different stages of training
    • SWA-Gaussian further approx the posterior distrib locally using only info from SGD
  • refs: https://medium.com/neuralspace/bayesian-neural-network-series-post-1-need-for-bayesian-networks-e209e66b70b2
  • https://www.cc.gatech.edu/~hic/8803-Fall-09/slides/8803-09-lec07.pdf for Laplace approximation that leads to Bayesian DNN

Practice: VAE

  • Goal = train and analyze a VAE on MNIST
    • first train an auto-encoder on MNIST: run this code to train the AE
import torch; torch.manual_seed(0)
import torch.nn as nn
import torch.nn.functional as F
import torch.utils
import torch.distributions
import torchvision
import numpy as np
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 200
device = "cpu"

class Encoder(nn.Module):
    def __init__(self, latent_dims):
        super(Encoder, self).__init__()
        self.linear1 = nn.Linear(784, 512)
        self.linear2 = nn.Linear(512, latent_dims)

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        x = F.relu(self.linear1(x))
        return self.linear2(x)

class Decoder(nn.Module):
    def __init__(self, latent_dims):
        super(Decoder, self).__init__()
        self.linear1 = nn.Linear(latent_dims, 512)
        self.linear2 = nn.Linear(512, 784)

    def forward(self, z):
        z = F.relu(self.linear1(z))
        z = torch.sigmoid(self.linear2(z))
        return z.reshape((-1, 1, 28, 28))

class Autoencoder(nn.Module):
    def __init__(self, latent_dims):
        super(Autoencoder, self).__init__()
        self.encoder = Encoder(latent_dims)
        self.decoder = Decoder(latent_dims)

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

def train(autoencoder, data, epochs=20):
    opt = torch.optim.Adam(autoencoder.parameters())
    for epoch in range(epochs):
        for x, y in data:
            x = x.to(device) # GPU
            opt.zero_grad()
            x_hat = autoencoder(x)
            loss = ((x - x_hat)**2).sum()
            print("train",epoch,loss.item())
            loss.backward()
            opt.step()
        if epoch%1==0:
            plot_latent(autoencoder, data)
            # plot_reconstructed(autoencoder)
    return autoencoder

latent_dims = 2
autoencoder = Autoencoder(latent_dims).to(device) # GPU

data = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST('./data',
               transform=torchvision.transforms.ToTensor(),
               download=True),
        batch_size=128,
        shuffle=True)

autoencoder = train(autoencoder, data)
  • Then plot the latent space with:
def plot_latent(autoencoder, data, num_batches=100):
    for i, (x, y) in enumerate(data):
        z = autoencoder.encoder(x.to(device))
        z = z.to('cpu').detach().numpy()
        plt.scatter(z[:, 0], z[:, 1], c=y, cmap='tab10')
        if i > num_batches:
            plt.colorbar()
            break
    plt.show()
  • Then plot the “reconstructed” space with:
def plot_reconstructed(autoencoder, r0=(-5, 10), r1=(-10, 5), n=12):
    w = 28
    img = np.zeros((n*w, n*w))
    for i, y in enumerate(np.linspace(*r1, n)):
        for j, x in enumerate(np.linspace(*r0, n)):
            z = torch.Tensor([[x, y]]).to(device)
            x_hat = autoencoder.decoder(z)
            x_hat = x_hat.reshape(28, 28).to('cpu').detach().numpy()
            img[(n-1-i)*w:(n-1-i+1)*w, j*w:(j+1)*w] = x_hat
    plt.imshow(img, extent=[*r0, *r1])
  • look at the areas not seen in the latent space
  • Then modify the AE into a VAE with the class at next slide
    • TODO = compute the KL term of the loss
    • then add this KL term into the loss in the train() function:
loss = ((x - x_hat)**2).sum() + autoencoder.encoder.kl
  • VAE class:
class VariationalEncoder(nn.Module):
    def __init__(self, latent_dims):
        super(VariationalEncoder, self).__init__()
        self.linear1 = nn.Linear(784, 512)
        self.linear2 = nn.Linear(512, latent_dims)
        self.linear3 = nn.Linear(512, latent_dims)

        self.N = torch.distributions.Normal(0, 1)
        self.N.loc = self.N.loc.cuda() # hack to get sampling on the GPU
        self.N.scale = self.N.scale.cuda()
        self.kl = 0

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        x = F.relu(self.linear1(x))
        mu =  self.linear2(x)
        sigma = torch.exp(self.linear3(x))
        z = mu + sigma*self.N.sample(mu.shape)
        self.kl = # TODO: compute and store the KL loss here
        return z
  • compare the size and shape of the latent spaces of AE and VAE
  • compare the reconstructions in areas not covered by AE

Practice: Pyro

  • Install:
pip install pyro-ppl
  • Doc: https://readthedocs.org/projects/pyro-ppl/downloads/pdf/dev/
  • slides: https://sshkhr.github.io/files/ProbProgtutorial.pdf

Sampling from a distribution:

import pyro.distributions as dist
from pyro import sample
from torch import tensor
coinflip = sample("coinflip", dist.Bernoulli(probs=0.5))
noisy_sample = sample("noisy_sample", dist.Normal(loc=0, scale=1))
  • creates named random variables

Computing log-likelihood:

print(dist.Normal(0, 1).log_prob(tensor(0.35)).exp()) # 0.3752

Implementing a generative model: \(P(sleeptime,tired) = P(sleeptime|tired)P(tired)\)

def genmod():
    amTired = sample("being_tired", dist.Bernouilli(0.7))
    if amTired: sleepTime = sample("sleep_time", dist.Normal(8,1))
    else: sleepTime = sample("sleep_time", dist.Normal(6,1))
    return sleepTime
  • How can you answer questions ?
    • joint proba of a sample: \(P(sleeptime=2,tired=1)\)
    • joint distrib: \(P(sleeptime,tired)\)
    • marginal proba of a sample: \(P(tired=1)\)
    • marginal distrib: \(P(tired)\)
  • Useful method 1: trace
    • collect all (var name, sampled value)
  • Useful method 2: condition
    • Set the value of a variable instead of sampling
  • Get the joint probability of sample (as a pytorch computation graph !):
from pyro.poutine import trace
from pprint import pprint
# trace() internally calls genmod()
tr = trace(genmod).get_trace()
pprint({
    name: {
        'value': props['value'],
        'prob': props['fn'].log_prob(props['value']).exp()
    }
    for (name, props) in tr.nodes.items()
    if props['type'] == 'sample'
})
print(tr.log_prob_sum().exp())
  • Get the marginal distribution over each variable:
import pandas as pd
import matplotlib.pyplot as plt

traces = []
for _ in range(1000):
    tr = trace(sleep_model).get_trace()
    values = {
        name: props['value'].item() for (name, props) in tr.nodes.items() if props['type'] == 'sample'
    }
    traces.append(values)
pd.DataFrame(traces).hist()
plt.show()
from pyro import condition

cond_model = condition(genmod, {
    "being_tired": tensor(1.),
    "sleep_time": tensor(10.)
})
trace(cond_model).get_trace().log_prob_sum().exp()
  • Conditional proba: what is the probability that I’m tired given that I slept 6.5 hours ?
  • Can be done with Bayes rule: \[P(tired=1|sleep=6.5) = \frac {P(t=1,s=6.5)}{\sum_t P(t,s=6.5)}\]
  • Pb1: with many variables, exponential growth of terms in the sum
  • Pb2: with continuous variables, integral instead of sum
  • approximate inference (cf. CS 228 of Stanford)
    • MCMC or variational
    • pyro focuses on VI
  • Toy model: Gaussian with 1 parameter (= pytorch parameter tensor) initialized at 0.3:
from pyro import param

mu = param("mu",tensor(0.3))
mod = dist.Normal(mu,1)

MLE training

  • Toy data: we observe 1 value \(x=5\)
  • Compute prob = likelihood for this value
  • SGD update to maximize prob:
prob = mod.log_prob(5)
prob.backward()
mu.data += mu.grad
  • Rewrite with our previous methods:
def model():
    mu = param("mu", tensor(0.))
    return sample("x", dist.Normal(mu, 1))

cond_model = condition(model, {"x": 5})
tr = trace(cond_model).get_trace()
tr.log_prob_sum().backward()
mu = param("mu")
mu.data += mu.grad
  • Note that each variable is uniquely identified by its name !
  • With pytorch optimizer:
from torch.optim import Adam

def model():
    mu = param("mu", tensor(0.))
    return sample("x", dist.Normal(mu, 1))

model() # Instantiate the mu parameter
cond_model = condition(model, {"x": tensor(5.)})
optimizer = Adam([param("mu")], lr=0.01)
mus, losses = [], []
for _ in range(100):
    tr = trace(cond_model).get_trace()
    # minimize negative log probability
    prob = -tr.log_prob_sum()
    prob.backward()
    optimizer.step()
    optimizer.zero_grad()
    losses.append(prob.item())
    mus.append(param("mu").item())
pd.DataFrame({"mu": mus, "loss": losses}).plot(subplots=True)
plt.show()

Posterior Inference

  • so far, we’ve done MLE
  • we need posterior inference to answer: what is the probability that I’m tired given that I slept 6.5 hours ?
  • we cannot simply set “sleeptime=6.5” and sample values of “tired”, because the generative model first sample “tired”, and then set “sleeptime”.
  • we would need to sample both variables, and filter sleeptime=6.5 afterwards: intractable !
  • approx. inference with VI
  • var. distrib \(q\) is called “guide” in pyro:
  • we may approx. our target posterior \(p(tired|sleeptime=6.5)\) with a dirac at 1:
def sleep_guide():
  sample("being_tired", dist.Delta(1.))
  • we can compute ELBO (KL-div btw variational and posterior dist.) given a guide: \[E_{t\sim q} [\log P(t,s=6.5) - \log P_q(t)]\]
def elbo(guide, cond_model):
  dist = 0.
  # the expected value is a sum over possible values of "tired"
  for t in [0., 1.]:
      # we define the fct to compute the proba of a given t from a given model
      log_prob = lambda f: trace(condition(f, {"being_tired": tensor(t)})).get_trace().log_prob_sum()
      # the marginal from the guide
      guide_prob = log_prob(guide)
      # the joint from the model
      cond_model_prob = log_prob(cond_model)
      term = guide_prob.exp() * (cond_model_prob - guide_prob)
      if not torch.isnan(term): dist += term
  return dist

underslept = condition(genmod, {"sleep_time": 6.5})
elbo(sleep_guide, underslept)
  • we can now make the ELBO smaller:
pyro.clear_param_store()

def sleep_guide():
    # Constraints ensure facts always remain true during optimization,
    # e.g. that the parameter of a Bernoulli is always between 0 and 1
    valid_prob = constraints.interval(0., 1.)
    t = param('qt', tensor(0.8), constraint=valid_prob)
    sample('being_tired', dist.Bernoulli(t))

# Run once to register the param calls with Pyro
sleep_guide()

adam = Adam([param('qt').unconstrained()], lr=0.005)
param_vals = []
for _ in range(2000):
    # We can use our elbo function from earlier and compute its gradient
    loss = -elbo(sleep_guide, underslept)
    loss.backward()
    adam.step()
    adam.zero_grad()
    param_vals.append({k: param(k).item() for k in ['qt']})
pd.DataFrame(param_vals).plot(subplots=True)
  • pb: if we use a continuous variable, ELBO requires to compute an integral: intractable
  • we can approximate this integral by sampling several values, or just one !
from pyro.poutine import replay

def elbo_approx(guide, cond_model):
    # sample one value from the guide to approximate the sum
    guide_trace = trace(guide).get_trace()
    model_trace = trace(replay(cond_model, guide_trace)).get_trace()
    return model_trace.log_prob_sum() - guide_trace.log_prob_sum()
  • replay() == condition() but with a trace
  • tip: when VI does not work, check the gradient !
  • when gradient unstable, you may use surrogate objective:
def elbo_better_approx(guide, cond_model):
    guide_trace = trace(guide).get_trace()
    model_trace = trace(replay(cond_model, guide_trace)).get_trace()
    elbo = model_trace.log_prob_sum() - guide_trace.log_prob_sum()
    # "detach" means "don't compute gradients through this expression"
    return guide_trace.log_prob_sum() * elbo.detach() + elbo
  • A simpler version of all this:
from pyro.optim import Adam
from pyro.infer import SVI, Trace_ELBO

adam = Adam({"lr": 0.005})
svi = SVI(underslept, sleep_guide, adam, loss=Trace_ELBO())
param_vals = []
for _ in range(2000):
    svi.step()
    param_vals.append({k: param(k).item() for k in ["qt"]})
  • Exercices:
    • proba of being tired given that I slept 6.5 hours ?
    • compare with “slept 9.5 hours” ? does it follow intuition ?
  • Ex. to manipulate the methods:
    • compute ELBO with other guides and check the values are ranked according to intuition
    • plot the gradient of being_tired during training
  • Another example: do the tuto with ‘fair coins’ here
  • designing guides:
    • flexible enough to approx. the distrib well
    • with continuous variables for gradient descent

ref: https://willcrichton.net/notes/probabilistic-programming-under-the-hood/

Pyro and DNN

  • Pyro HiddenLayer implements a linear NN layer + non-linear activation
    • with Normal prior in the weights
    • with local reparameterization trick to speed up inference
      • instead of sampling all NxN weights (W), we sample the N outputs of each layer (W.X)

MNIST in pyro: https://github.com/paraschopra/bayesian-neural-network-mnist