1Cademy - A system for ranking text responses first assigns a numerical reward score to each response, and then calculates a worth value for each response using the formula: worth = exp(reward score). Consider two scenarios: Scenario 1: Response A has a reward score of 3.0, and Response B has a reward score of 1.0. Scenario 2: Response C has a reward score of 8.0, and Response D has a reward score of 6.0. How does the ratio of worths (Worth_A / Worth_B) in Scenario 1 compare to the ratio of worths (Wo

Learn Before

Worth Function in Plackett-Luce for RLHF Reward Modeling

Multiple Choice

A system for ranking text responses first assigns a numerical reward score to each response, and then calculates a 'worth' value for each response using the formula: worth = exp(reward score). Consider two scenarios:

Scenario 1: Response A has a reward score of 3.0, and Response B has a reward score of 1.0. Scenario 2: Response C has a reward score of 8.0, and Response D has a reward score of 6.0.

How does the ratio of worths (Worth_A / Worth_B) in Scenario 1 compare to the ratio of worths (Wo

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course