Self-attention: practice

  • Create 3 random pytorch tensors, with dimension n: they represent your input sequence of length 3: \(X_1,X_2,X_3\)
  • Create 3 random matrices for queries, keys and values
  • Compute, in one operation, all the key tensors from all input tensors.
  • Do the same for values and queries.
  • Compute the dot-product between the queries Q and all keys K1,K2,K3
  • You should get 3x3 attention scores (scalar)
  • Normalize these attention scores with a softmax
  • Compute the sum of the input value tensors, weighted by their corresponding attention score
  • You obtain the output sequence of vectors.
  • Goal: perform classification after attention
  • Compress this variable-length final sequence into a single fixed-size vector with max-pooling
  • Add a final classification layer to predict 2 classes and put all this inside a pytorch nn.Module model

Limitations

  • Generate random scalar sequences:
    • class A: the sequence is composed of observations uniformely sampled between 0 and 1
    • class B: the sequence is composed of observations uniformely sampled between 1 and 2
  • Train your self-attentive classifier
  • Analyze
  • Generate random sequences:
    • class A: the first half of the sequence is composed of observations uniformely sampled between 0 and 1, and the second half between 1 and 2
    • class B: the first half between 1 and 2, and the second half between 0 and 1
  • Train your self-attentive classifier
  • Analyze