- Create 3 random pytorch tensors, with dimension n: they represent
your input sequence of length 3: \(X_1,X_2,X_3\)
- Create 3 random matrices for queries, keys and values
- Compute, in one operation, all the key tensors from all input
tensors.
- Do the same for values and queries.

- Compute the dot-product between the
*queries* Q and all
*keys* K1,K2,K3
- You should get 3x3 attention scores (scalar)
- Normalize these attention scores with a softmax

- Compute the sum of the input
*value* tensors, weighted by
their corresponding attention score
- You obtain the output sequence of vectors.

- Goal: perform classification after attention
- Compress this variable-length final sequence into a single
fixed-size vector with max-pooling
- Add a final classification layer to predict 2 classes and put all
this inside a pytorch nn.Module model

### Limitations

- Generate random scalar sequences:
- class A: the sequence is composed of observations uniformely sampled
between 0 and 1
- class B: the sequence is composed of observations uniformely sampled
between 1 and 2

- Train your self-attentive classifier
- Analyze

- Generate random sequences:
- class A: the first half of the sequence is composed of observations
uniformely sampled between 0 and 1, and the second half between 1 and
2
- class B: the first half between 1 and 2, and the second half between
0 and 1

- Train your self-attentive classifier
- Analyze