Self-attention: practice

Create 3 random pytorch tensors, with dimension n: they represent your input sequence of length 3: \(X_1,X_2,X_3\)
Create 3 random matrices for queries, keys and values
Compute, in one operation, all the key tensors from all input tensors.
Do the same for values and queries.

Compute the dot-product between the queries Q and all keys K1,K2,K3
You should get 3x3 attention scores (scalar)
Normalize these attention scores with a softmax

Compute the sum of the input value tensors, weighted by their corresponding attention score
You obtain the output sequence of vectors.

Goal: perform classification after attention
Compress this variable-length final sequence into a single fixed-size vector with max-pooling
Add a final classification layer to predict 2 classes and put all this inside a pytorch nn.Module model

Limitations

Generate random scalar sequences:
- class A: the sequence is composed of observations uniformely sampled between 0 and 1
- class B: the sequence is composed of observations uniformely sampled between 1 and 2
Train your self-attentive classifier
Analyze

Generate random sequences:
- class A: the first half of the sequence is composed of observations uniformely sampled between 0 and 1, and the second half between 1 and 2
- class B: the first half between 1 and 2, and the second half between 0 and 1
Train your self-attentive classifier
Analyze