by giving positional encodings the same dimension as word
embeddings, we can sum them together
most positional information is encoded in the first dimensions, so
summing them with word embeddings enable the model to “let” the first
dimensions free of semantics and dedicate them to positions.
RLHF / DPO = finetuned LLM aligned with
human values
pretrained LLM:
has all the knowledge inside
is not able to chat, answer questions…
instruction finetuned LLM:
no more knowledge, but can interact with human
aligned LLM:
does not say “bad” things
Zero-Shot learning
You may query LLM without further training:
“Is this review positive or negative? Review: this
is the best cast iron skillet you will ever buy”
“Positive”
“A is the son’s of B’s uncle. What is the family
relationship between A and B?”
“cousins”
“A is the son’s of B’s uncle. What is B for A?”
“brother”
“On a shelf, there are five books: a gray book, a
red book, a purple book, a blue book, and a black book. The red book is
to the right of the gray book. The black book is to the left of the blue
book. The blue book is to the left of the gray book. The purple book is
the second from the right. Which book is the leftmost book?”
“The black book”
Prompt programming
is the art of designing prompt to perform a task
Prompting may be viewed as a way to constraint the generation
You may describe the task
You may give examples (few-shots)
You may give an imaginary context to “style” the result
How to describe the task:
direct task description
proxy task description
Direct task description:
“translate French to English”
Can be contextual:
“French: … English: …”
Direct description can combine tasks the model must know:
“rephrase this paragraph so that a 2nd grade can understand it,
emphasizing real-world applications”
Proxy task description
This is a novel written in the style of J.R.R. Tolkien’s Lord of the
Rings fantasy novel trilogy. It is a parody of the following
passage:
“S. Jane Morland was born in Shoreditch …”
Tolkien rewrote the previous passage in a high-fantasy style, keeping
the same meaning but making it sound like he wrote it as a fantasy; his
parody follows:
Few-shot prompts:
English: Writing about language models is fun. Roish: Writingro
aboutro languagero modelsro isro funro. English: The weather is lovely!
Roish:
def f(x,offset):
return 0.3*math.sin(0.1*x+offset)+0.5
nex=100
nsteps=50
input_seqs = []
target_seqs = []
for ex in range(nex):
offset = np.random.rand()
input_seq=[f(x,offset) for x in range(nsteps)]
cl = np.random.randint(2)
target_seqs.append(cl)
if cl==0: perturb = 0.05
else: perturb = -0.05
pos=np.random.randint(25,45)
for t in range(pos,pos+5): input_seq[t]+=perturb
input_seqs.append(input_seq)
input_seq = torch.Tensor(input_seqs)
input_seq = input_seq.view(nex,nsteps,1)
target_seq = torch.LongTensor(target_seqs)
Train attentive RNN:
Use 10000 epochs, LR=0.0001 and RMSprop optimizer
Use the CrossEntropyLoss() instead of the MSELoss() to learn the two
classes
Does it learn to predict the two classes correctly? Is learning
stable?
After training, plot both the input curve and the attention weights,
for the first 5 curves: does attention correctly spots the
perturbation?
Try without the offset: what happens? Does attention spots the
perturbation? Explain.
Try to find better hyper-parameters so that convergence is
faster.
Modify the training loop so that random curve generation is
generated directly inside the training loop: there is no more any epoch,
but only an infinite sequence of random batches: what happens?
Try with longer vs. shorter and smaller/bigger perturbations: in
which cases does it work or not? How sensitive is the approach to
perturbations?