Short Answer

Analyzing the Trade-off in a Policy Optimization Objective

Consider the following composite objective function used in a policy optimization algorithm: Ucomposite=UsurrogateβPenaltyU_{\text{composite}} = U_{\text{surrogate}} - \beta \cdot \text{Penalty} Explain the fundamental trade-off that the hyperparameter β is designed to manage during the training process.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science