LLM4ALL Year 1 meeting

  • News:
    • consortium agreement
    • Carbon-ANR 19 nov
    • journee TAL Rennes 13 dec
  • Deliverables: deadline 31 oct
    • D3.1 (augmented dialogue dataset)
    • D3.2 (summarization)
    • D4.1 (SimSAMU)
    • D4.2 (ASR)
    • D4.4 (ER calls)
    • D5.3 (exploitation plan)
  • Interm. report: 1 avr 2025

LLM4ALL WP1

  • 2023 objectives:
    • updating LLM in French with continual learning
    • proposed approach in 2023
      • Sheared-Llama: prune + retrain
  • Work done (Yaya talk at next PMT):

  • 1 year later, need to rethink objectives?
    • Issue 1: data
    • Issue 2: state of the art
  • data:
    • On which data shall we update our model?
    • Lessons from SOTA: (cf. “Physics of LM…”)
      • Need 1000x occurrences of 1 new information
      • Standard finetuning is worth only 1 bit
      • Long finetuning on new data induces large forgetting
      • Pretrained LLMs have low capacity to acquire new knowledge
  • 1 year recent French data: 1M words
    • Data scrapped during 1 year from headline news from France-Info, Le Monde…
    • Could scale (with lots of work!) with Wiki, scrapping… to 100M or 1b words
    • This is no continued pretraining, just finetuning
  • The only “finetuning” option?
    • Mixing new (repeated) + old data to get more than 100b words
    • Continue training on this
  • Issues:
    • Large cost for each update
    • Finding the good mix is difficult (requires lots of costly experiments)
    • Isn’t it better to pretrain from scratch? (PLM not good at acquiring new knowledge!)
  • state-of-the-art:
    • Best updating method with “small” data: RAG, LLM agents
      • Tested QA on French News
      • RAG gets nearly perfect results… so up-to-date LLMs are solved ?!
      • Pb: cannot release data because of copyright

13 septembre 2024: Quel grand magasin parisien est menacé en raison de problèmes financiers et de retards de paiement depuis son rachat par la SGM il y a un an ?

Knowledge Entropy decay during Language Model Pretraining hinders new Knowledge Acquisition

  • Proposal: focus on more fundamental training aspects: speed, cost, generalization
    • Growing nets:
      • Good theoretical properties: their minima generalize well
      • Do they limit the knowledge entropy issue?

Discussion

  • Issues:
    • FR News: not enough data for CL
    • RAG solves LLM updating
  • What is the most impactful for WP1?
    • Stick to CL with LSN, but need to get many data
    • More theoretical study on growing nets

Future

  • Deliverables! deadline 31 oct

  • Topics of discussion:
    • Collaboration among us: which topic? how?
    • Extend our external impact: how?
    • Work on/improve Lucie?
    • Evaluation?
    • What is the impact of SOTA? Shall we focus more on RAG? multimodality? Synthetic data?