LIMA Dataset
The LIMA dataset is an instruction-following dataset consisting of only 1,000 highly curated samples derived from various natural language processing tasks. Fine-tuning a large model, such as the 65-billion parameter LLaMa, on this carefully crafted subset has been shown to produce performance competitive with or superior to models fine-tuned with substantially more data and effort.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A development team is fine-tuning a large, pre-trained language model to create a general-purpose assistant. One team member argues that to be effective, their fine-tuning dataset must contain examples of every conceivable task the assistant might be asked to perform, such as summarizing legal documents, writing poetry, translating between niche languages, and explaining complex scientific theories. Which of the following statements provides the most accurate critique of this team member's argument?
Evaluating Fine-Tuning Strategies
A large language model, initially trained on a vast and diverse corpus of text from the internet, is subsequently adjusted using a specialized dataset consisting only of 5,000 question-answer pairs about world geography. After this adjustment process, the model will be unable to generate a short poem about a sunset.
LIMA Dataset