Learn Before
  • The Pre-training and Fine-tuning Paradigm

Common Data Sources for Pre-training LLMs

The pre-training of large language models relies on vast and varied text corpora. Key sources for these datasets include webpages, books, conversational text, software code, Wikipedia, and news articles, in addition to other materials like scientific papers and content from question-and-answer (Q&A) platforms.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Types of Pretrained Language Model

  • Pre-training tasks

  • Extensions of Pre-trained models

  • Foundation Models

  • Historical Context of Pre-training

  • Examples of Pre-trained Transformers by Architecture

  • Paradigm Shift in NLP Driven by Pre-training

  • Future Research Directions in Large-Scale Pre-training

  • Role of Pre-training in Developing Latent Abilities

  • Common Data Sources for Pre-training LLMs

  • Training Auxiliary Parameters with a Fixed Transformer Model

  • Synergy of Transformers and Self-Supervised Learning

  • Core Problem Types in NLP Pre-training

  • Scope of Introductory Discussions on Pre-training

  • Application of Self-Supervised Pre-training Across Model Architectures

  • Scope of Foundational Concepts in Pre-training and Adaptation

  • Tokens vs. Words in NLP

  • Self-supervised Pre-training

  • Data Scale Disparity: Pre-training vs. Fine-tuning

  • A small biotech company wants to build an AI model to classify protein sequences for a very specific function. They have a high-quality, but small, labeled dataset of 10,000 sequences. They have limited computational resources and a tight deadline. Which of the following strategies represents the most effective and efficient approach for them to develop a high-performing model?

  • Diagnosing a Flawed Model Development Strategy

  • The development of large-scale AI models typically involves two distinct stages. Match each characteristic below to the stage it describes.

  • Scope of Introductory Discussion on Pre-training in NLP

Learn After
  • Evaluating Data Sources for LLM Pre-training

  • Data Source Selection for a Specialized LLM

  • A newly developed large language model demonstrates high fluency and generates grammatically perfect, conversational text. However, it frequently provides outdated information, struggles to generate well-structured, long-form content like reports, and often fabricates details when asked about events from the last year. Based on these specific performance characteristics, which of the following descriptions most likely represents the composition of its pre-training dataset?