Outcome-based Approaches for LLM Fine-Tuning
In outcome-based approaches to LLM fine-tuning, supervision is applied exclusively when the end result is verified. The model is optimized to maximize some form of reward, such as , based on the final outcome. This represents a standard methodology for learning from human feedback, where evaluation focuses on the complete input-output sequence rather than the intermediate steps.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Outcome-based Approaches for LLM Fine-Tuning
Process-based Approaches for LLM Fine-Tuning
A team is fine-tuning a large language model to solve complex, multi-step logic puzzles. They are testing two different supervisory approaches:
- Approach 1: The model generates the full sequence of reasoning steps and provides a final answer. A human evaluator then checks only if the final answer is correct. The model receives a positive signal if the answer is correct and a negative signal if it is incorrect, regardless of the reasoning steps.
- Approach 2: The model generates its reasoning one step at a time. After each step, a human evaluator checks if that individual step is logically sound and correctly follows from the previous ones. The model receives a supervisory signal for each intermediate step in its reasoning chain.
What is the fundamental difference in how supervision is applied in these two approaches?
Recommending a Fine-Tuning Strategy for an AI Algebra Tutor
A team is fine-tuning a large language model for multi-step reasoning tasks. They are considering two general approaches for providing supervision: one that focuses only on the final answer, and one that evaluates each step of the reasoning process. Classify each of the following scenarios or characteristics by matching it to the correct supervisory approach.
Learn After
Limitations of Outcome-Based Rewards for Entire Sequences
A team is fine-tuning a language model to act as a programming assistant that writes code. For each programming problem, the model generates a block of code. The fine-tuning process involves running the generated code against a set of predefined tests. If the code passes all the tests, the model receives a high reward. If it fails any test, it receives a low reward. The structure, style, or efficiency of the code itself is not directly evaluated for the reward signal. Which principle of model fine-tuning does this scenario best exemplify?
Identifying Fine-Tuning Methodologies
Analyzing Fine-Tuning Methodologies