Multiple Choice

A research team is fine-tuning a language model to be a highly accurate and safe legal assistant. They have two datasets available:

  • Dataset X: 2,000,000 legal question-answer pairs automatically scraped from public internet forums. A spot-check reveals that approximately 30% of the answers contain factual inaccuracies or outdated information.
  • Dataset Y: 75,000 legal question-answer pairs that have been carefully written, reviewed, and verified for accuracy by legal experts.

Which dataset should the team prioritize for fine-tuning to achieve the best performance for their specific goal, and what is the most compelling reason?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science