1Cademy - A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable

Learn Before

Data Quality as a Key Issue in LLM Training

Multiple Choice

A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related