Data Strategy for a Customer Support Chatbot
A startup is developing a specialized chatbot to answer technical support questions for their software product. The development team has a tight deadline and limited budget. They must choose between two potential datasets to train their model. Based on the descriptions below, which dataset represents the better strategic choice for the team? Justify your decision by evaluating the potential risks and benefits of each option in relation to the goal of creating a reliable and helpful chatbot.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Data Strategy for a Customer Support Chatbot
A research team is fine-tuning a language model to be a highly accurate and safe legal assistant. They have two datasets available:
- Dataset X: 2,000,000 legal question-answer pairs automatically scraped from public internet forums. A spot-check reveals that approximately 30% of the answers contain factual inaccuracies or outdated information.
- Dataset Y: 75,000 legal question-answer pairs that have been carefully written, reviewed, and verified for accuracy by legal experts.
Which dataset should the team prioritize for fine-tuning to achieve the best performance for their specific goal, and what is the most compelling reason?
Impact of Data Quality on Fine-Tuning Sample Size
When fine-tuning a language model for a specialized task, the most effective strategy is always to maximize the sheer volume of training examples, even if it means including data that is noisy, inconsistent, or only loosely related to the target task.