\[p_t^{(i)} = \begin{cases} sin(w_k \cdot t), \text{if }i=2k \\ cos(w_k \cdot t), \text{if }i=2k+1 \end{cases}\]
with \(d\) encoding dim and
\[w_k = \frac 1 {10000^{2k/d}}\]
Update Oct 2024: Transformers Learn Higher-Order Optimization Methods for In-Context Learning - They learn exponentially better learning algorithms than SGD, apparently similar to Iterative Newton’s method
BloomChat (Together.AI 2023)
Microsoft study (Nov. 2023)
Objectives:
Analyzing an LLM
Baidu paper 2017
Open-AI 2020
Chinchilla paper 2022
\(L=\) pretraining loss
Google 2022: paper1, paper2 Flops, Upstream (pretraining), Downstream (acc on 17 tasks), Params
GPT3 paper 2020
"eat" becomes "ate"
"draw" becomes "drew"
"vote" becomes
Anthropic paper 2022
Jason Wei has exhibited 137 emerging capabilities:
During training, LLMs may abruptly reorganize their latent representation space
Grokking exhibits structured latent space
First proof that overfitting may be addressed by increasing the nb of parameters!
(from huggingface)
\[h_t = Ah_{t-1} + Bx_t\] \[y_t = Ch_t + D x_t\]
Ex: Mamba
(from PMC24)
Principle:
Which tasks for training?
Multilingual models:
Key ingredients to success:
Data P., model sharding, Tensor P., Sequence P., pipeline P…
Objectives:
Finetuning | Continual pretraining |
---|---|
Adapt to domain/lang/task | acquire new knowledge |
LLM looses other capacities | capture language drift |
stays generic and adaptable | |
Finetuning | Continual pretraining |
---|---|
overfitting | overfitting |
catastrophic forgetting | |
overcoming reduced learnability | |
cost |
- Vanilla prompting
- Chain-of-thought (CoT)
- Self-consistency
- Ensemble refinment
- Automatic chain-of-thought (Auto-CoT)
- Complex CoT
- Program-of-thoughts (PoT)
- Least-to-Most
- Chain-of-Symbols (CoS)
- Structured Chain-of-Thought (SCoT)
- Plan-and-solve (PS)
- MathPrompter
- Contrastive CoT/Contrastive self-consistency
- Federated Same/Different Parameter self-consistency/CoT
- Analogical reasoning
- Synthetic prompting
- Tree-of-toughts (ToT)
- Logical Thoughts (LoT)
- Maieutic Prompting
- Verify-and-edit
- Reason + Act (ReACT)
- Active-Prompt
- Thread-of-thought (ThOT)
- Implicit RAG
- System 2 Attention (S2A)
- Instructed prompting
- Chain-of-Verification (CoVe)
- Chain-of-Knowledge (CoK)
- Chain-of-Code (CoC)
- Program-Aided Language Models (PAL)
- Binder
- Dater
- Chain-of-Table
- Decomposed Prompting (DeComp)
- Three-Hop reasoning (THOR)
- Metacognitive Prompting (MP)
- Chain-of-Event (CoE)
- Basic with Term definitions
- Basic + annotation guideline + error-analysis
<OBJECTIVE_AND_PERSONA>
You are a [insert a persona, such as a "math teacher" or "automotive expert"]. Your task is to...
</OBJECTIVE_AND_PERSONA>
<INSTRUCTIONS>
To complete the task, you need to follow these steps:
1.
2.
...
</INSTRUCTIONS>
------------- Optional Components ------------
<CONSTRAINTS>
Dos and don'ts for the following aspects
1. Dos
2. Don'ts
</CONSTRAINTS>
<CONTEXT>
The provided context
</CONTEXT>
<OUTPUT_FORMAT>
The output format must be
1.
2.
...
</OUTPUT_FORMAT>
<FEW_SHOT_EXAMPLES>
Here we provide some examples:
1. Example #1
Input:
Thoughts:
Output:
...
</FEW_SHOT_EXAMPLES>
<RECAP>
Re-emphasize the key aspects of the prompt, especially the constraints, output format, etc.
</RECAP>
TASK:
Classify the OBJECTS.
CLASSES:
- Large
- Small
OBJECTS:
- Rhino
- Mouse
- Snail
- Elephant
What is the most likely interpretation of this sentence? Explain your reasoning. The sentence: “The chef seasoned the chicken and put it in the oven because it looked pale.”
Extract the main issues and sentiments from the customer feedback on our telecom services.
Focus on comments related to service disruptions, billing issues, and customer support interactions.
Please format the output into a list with each issue/sentiment in a sentence, separated by semicolon.
Input: CUSTOMER_FEEDBACK
Classify the extracted issues into categories such as service reliability, pricing concerns, customer support quality, and others.
Please organize the output into JSON format with each issue as the key, and category as the value.
Input: TASK_1_RESPONSE
Generate detailed recommendations for each category of issues identified from the feedback.
Suggest specific actions to address service reliability, improving customer support, and adjusting pricing models, if necessary.
Please organize the output into a JSON format with each category as the key, and recommendation as the value.
Input: TASK_2_RESPONSE
CoT requires large models:
Question What is the elevation range for the area that the eastern sector of the
Colorado orogeny extends into?
Thought 1 I need to search Colorado orogeny, find the area that the eastern sector
of the Colorado orogeny extends into, then find the elevation range of the
area.
Action 1 Search[Colorado orogeny]
Observation 1 The Colorado orogeny was an episode of mountain building (an orogeny) in
Colorado and surrounding areas.
Thought 2 It does not mention the eastern sector. So I need to look up eastern
sector.
Action 2 Lookup[eastern sector]
Observation 2 (Result 1 / 1) The eastern sector extends into the High Plains and is called
the Central Plains orogeny.
Thought 3 The eastern sector of Colorado orogeny extends into the High Plains. So I
need to search High Plains and find its elevation range.
Aside from the Apple Remote, what other devices can control the program Apple Remote was originally designed to interact with?
import dspy
lm = dspy.LM(model="ollama/qwen2.5", api_base="http://localhost:11434")
dspy.configure(lm=lm)
qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")
print(response.response)
dspy.inspect_history(n=1)
cot = dspy.ChainOfThought('question -> response')
res = cot(question="should curly braces appear on their own line?")
print(res.response)
dspy.inspect_history(n=1)
from dspy.datasets import MATH
dataset = MATH(subset='algebra')
dev = dataset.dev[0:10]
example = dataset.train[0]
print("Question:", example.question)
print("Answer:", example.answer)
module = dspy.ChainOfThought("question -> answer")
print(module(question=example.question))
evaluate = dspy.Evaluate(devset=dev, metric=dataset.metric)
evaluate(module)
Implementing a RAG with DSPy; required imports:
import dspy
from dspy.teleprompt import BootstrapFewShot
from dspy.datasets import HotPotQA
from dspy.evaluate import Evaluate
from sentence_transformers import SentenceTransformer
import pandas as pd
from dspy.evaluate.evaluate import Evaluate
passages0 = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-bioasq/data/passages.parquet/part.0.parquet")
test = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-bioasq/data/test.parquet/part.0.parquet")
passages = passages0[0:20]
class RetrievalModel(dspy.Retrieve):
def __init__(self, passages):
self.passages = passages
self.passages["valid"] = self.passages.passage.apply(lambda x: len(x.split(' ')) > 20)
self.passages = self.passages[self.passages.valid]
self.passages = self.passages.reset_index()
for i,x in enumerate(self.passages.passage.tolist()):
print("DOC",i,x)
self.model = SentenceTransformer("all-MiniLM-L6-v2")
self.passage_embeddings = self.model.encode(self.passages.passage.tolist())
def __call__(self, query, k):
query_embedding = self.model.encode(query)
similarities = self.model.similarity(query_embedding, self.passage_embeddings).numpy() # cosine similarities
top_indices = similarities[0, :].argsort()[::-1][:k] # pick TopK documents having highest cosine similarity
response = self.passages.loc[list(top_indices)]
response = response.passage.tolist()
return [dspy.Prediction(long_text= psg) for psg in response]
rm = RetrievalModel(passages)
qq = "Which cell may suffer from anemia?"
print(rm(qq,2))
llm = dspy.LM(model="ollama/qwen2.5:0.5b", api_base="http://localhost:11434")
print("llm ok")
dspy.settings.configure(lm=llm,rm=rm)
class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""
context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")
class RAG(dspy.Module):
def __init__(self, num_passages=2):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)
rag = RAG()
pred = rag(qq)
print(f"Question: {qq}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")
llm.inspect_history(n=1)
dataset = []
for index, row in test.iterrows():
dataset.append(dspy.Example(question=row.question, answer=row.answer).with_inputs("context", "question"))
trainset, devset = dataset[:4], dataset[17:20]
def validate_context_and_answer(example, pred, trace=None):
answer_EM = dspy.evaluate.answer_exact_match(example, pred)
answer_PM = dspy.evaluate.answer_passage_match(example, pred)
return answer_EM and answer_PM
valuate_on_devset = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=10)
evalres = evaluate_on_devset(rag, metric=validate_context_and_answer)
print(f"Evaluation Result: {evalres}")
teleprompter = BootstrapFewShot(metric=validate_context_and_answer, max_bootstrapped_demos=2, max_labeled_demos=2)
compiled_rag = teleprompter.compile(rag, trainset=trainset)
evalres = evaluate_on_devset(compiled_rag, metric=validate_context_and_answer)
print(f"Evaluation Result: {evalres}")
llm.inspect_history(n=1)
import dspy
from dsp.utils import deduplicate
from dspy.teleprompt import BootstrapFewShot
from dspy.retrieve.qdrant_rm import QdrantRM
from qdrant_client import QdrantClient
formatted_list = ["Phone Name: HTC Desire 610 8GB Unlocked GSM 4G LTE Quad-Core Android 4.4 Smartphone - Black (No Warranty)\nReview: The phone is very good , takes very sharp pictures but the screen is not bright'",
"Phone Name: Apple iPhone 6, Space Gray, 128 GB (Sprint)\nReview: I am very satisfied with the purchase, i got my iPhone 6 on time and even received a screen protectant with a charger. Thank you so much for the iPhone 6, it was worth the wait.",
]
client = QdrantClient(":memory:")
def add_documents(client, collection_name, formatted_list, batch_size=10):
for i in range(0, len(formatted_list), batch_size):
batch = formatted_list[i:i + batch_size]
batch_ids = list(range(i + 1, i + 1 + len(batch)))
client.add(
collection_name=collection_name,
documents=batch,
ids=batch_ids
)
print(f"Batch {i // batch_size + 1} added with {len(batch)} documents.")
add_documents(client, "phone_collection", formatted_list)
qdrant_retriever_model = QdrantRM("phone_collection", client)
dspy.settings.configure(lm= llm, rm=qdrant_retriever_model)
class Multihoprag(dspy.Module):
def __init__(self, passages_per_hop=3, max_hops=2):
super().__init__()
self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
self.retrieve = dspy.Retrieve(k=passages_per_hop)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
self.max_hops = max_hops
def forward(self, question):
context = []
for hop in range(self.max_hops):
query = self.generate_query[hop](context=context, question=question).query
passages = self.retrieve(query).passages
context = deduplicate(context + passages)
pred = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=pred.answer)
trainset_list = [
{
"Question": "Which phones have the best camera quality and battery life based on recent reviews and specifications?",
"Answer": "Here's a list of phones that meet your criteria:\n\n1. Samsung Galaxy S21 Ultra\n2. Google Pixel 6 Pro\n3. Apple iPhone 13 Pro Max\n4. OnePlus 9 Pro\n5. Xiaomi Mi 11 Ultra\n\nNotes: These phones were picked based on their high ratings for camera quality and long-lasting battery life, as reported by recent reviews and detailed specifications."
},
{
"Question": "What are the top-rated phones with the best display and performance in the market right now?",
"Answer": "Here's a list of phones that meet your criteria:\n\n1. Samsung Galaxy S22\n2. Apple iPhone 14 Pro\n3. OnePlus 10 Pro\n\nNotes: These phones were selected because they have received excellent reviews for their display clarity and performance speed, making them ideal for users seeking high-quality visuals and efficient processing."
},
{
"Question": "Can you recommend phones that have the best user interface and build quality according to recent user reviews?",
"Answer": "Here's a list of phones that meet your criteria:\n\n1. Nokia 8.3 5G\n2. Sony Xperia 1 III\n\nNotes: These phones were chosen due to their outstanding user interface design and robust build quality, which have been highly praised in recent user reviews and expert evaluations."
}
]
trainset = [dspy.Example(question=item["Question"], answer=item["Answer"]).with_inputs('question') for item in trainset_list]
# metric function that prefers short and non-repetitive answers
def validate_answer_and_hops(example, pred, trace=None):
# if not validate(pred.answer == example.answer): return False
hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]
if max([len(h) for h in hops]) > 100: return False
if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False
return True
teleprompter = BootstrapFewShot(metric=validate_answer_and_hops, )
uncompiled_rag = Multihoprag()
compiled_rag = teleprompter.compile(student=uncompiled_rag, trainset= trainset)
print(uncompiled_rag("Which smartphones are highly rated for its low-light camera performance also have a great front camera"))
print(compiled_rag("Which smartphones are highly rated for its low-light camera performance also have a great front camera"))
If the LLM good enough, no need to finetune?
Is it possible to get a good enough LLM?
Method | Bits | 7B | 13B | 30B | 70B | 110B | 8x7B | 8x22B |
---|---|---|---|---|---|---|---|---|
Full | 32 | 120GB | 240GB | 600GB | 1200GB | 2000GB | 900GB | 2400GB |
Full | 16 | 60GB | 120GB | 300GB | 600GB | 900GB | 400GB | 1200GB |
LoRA/GaLore/BAdam | 16 | 16GB | 32GB | 64GB | 160GB | 240GB | 120GB | 320GB |
QLoRA | 8 | 10GB | 20GB | 40GB | 80GB | 140GB | 60GB | 160GB |
QLoRA | 4 | 6GB | 12GB | 24GB | 48GB | 72GB | 30GB | 96GB |
QLoRA | 2 | 4GB | 8GB | 16GB | 24GB | 48GB | 18GB | 48GB |
Method | data | notes |
---|---|---|
Pretraining | >10T | Full training |
Cont. pretr. | \(\simeq 100\)b | update: PEFT? |
Finetuning | 1k … 1b | Adapt to task: PEFT |
Few-Shot learning | < 1k | Guide, help the LLM |
\[X=(RELU(X\cdot W_{down})) \cdot W_{up} + X\]
with
\[W_{down} \in R^{d\times k}~~~~W_{up} \in R^{k\times d}\]
Performance comparison
model.add_adapter(lora_config, adapter_name="adapter_1")
model.add_adapter(lora_config, adapter_name="adapter_2")
model.set_adapter("adapter_1")
can be applied to sparse FT
FT an LLM on specific task/lang
extract the mask = params that change most
rewind the LLM and re-FT with mask
sparse finetunes can be combined without overlapping!
Pretraining and FT use same basic algorithm (SGD), but the differences in data size lead to differences in training regimes.
Why such a difference in regimes?
Why doesn’t pretraining forget?
1- chunk the long seq, encode each chunk independently 2- for each token to generate: - the long-context input is composed of (long) past KV-cache + current tokens - the KV-cache is composed of: - (small) initial sequence is kept - (long) evicted sequence - evicted KV are stored in external memory - at test time, a lookup f() selects KV from external memory to add to the small context
There’s some hope though…
Cost (T.CO2) | |
---|---|
560 persons virtual conf | 10 |
560 persons F2F conf | 274 |
18000 persons virtual conf | 176 |
18000 persons F2F conf | 10348 |
emissions 1 car/year source | 2.2 |
emissions cars in France/year | 65M |
training Bloom | 25 |
Loss: UL2
MAGMAX: Leveraging Model Merging for Seamless Continual Learning
Forgetting by finetuning does not work: LLMs don’t forget internally: https://arxiv.org/abs/2409.02228
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import random
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
print(tokenizer.special_tokens_map)
class Traindata(torch.utils.data.Dataset):
def __init__(self):
super().__init__()
d = ['a','b']
self.y = [0,1]
self.x = tokenizer.batch_encode_plus(d,return_tensors='pt')['input_ids'].split(1)
print("tokenization done",len(self.x))
def __len__(self):
return len(self.x)
def __getitem__(self,i):
return self.x[i], self.y[i]
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased')
for n,p in model.named_parameters():
if not '.layer.5' in n: p.requires_grad=False
print(n,p.shape)
opt = torch.optim.SGD(model.parameters(), lr = 1e+5)
traindata = Traindata()
trainloader = torch.utils.data.DataLoader(traindata, batch_size=1, shuffle=True)
for ep in range(100):
totl = 0.
for x,y in trainloader:
opt.zero_grad()
x = x.view(1,-1)
yy = torch.LongTensor(y)
pred = model(x)
loss = torch.nn.functional.cross_entropy(pred['logits'],yy)
totl += loss.item()
loss.backward()
opt.step()
print(ep,totl)
writer.add_scalar("trainloss", totl, ep)
writer.flush()
exit()
# to view the curves:
# tensorboard --logdir=runs/
Great pedagogical point of view about LLM by Sasha Rush: video
calcul des Flops: https://kipp.ly/transformer-inference-arithmetic/