Learn Before
Evaluating Data Sources for LLM Pre-training
A technology company is developing a new large language model intended for public use. A significant portion of its pre-training data is sourced from a massive crawl of the public internet, including forums and social media. Critically evaluate the decision to rely heavily on this type of data. In your response, you must weigh the potential benefits for the model's general capabilities against the significant ethical and performance-related risks.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating Data Sources for LLM Pre-training
Data Source Selection for a Specialized LLM
A newly developed large language model demonstrates high fluency and generates grammatically perfect, conversational text. However, it frequently provides outdated information, struggles to generate well-structured, long-form content like reports, and often fabricates details when asked about events from the last year. Based on these specific performance characteristics, which of the following descriptions most likely represents the composition of its pre-training dataset?
GPT-3
Falcon
LLaMA2
PaLM-450B
Gemma-7B