Multiple Choice

A language model is being trained to generate text. At a certain step, it considers generating the next token. The system has the following estimates:

  • The value (expected future rewards) of the current state is 1.2.
  • After generating a specific token, the immediate reward received is +0.5.
  • The value of the new state after generating the token is 1.0.
  • The discount factor for future rewards is 0.9.

Based on the standard temporal difference method for estimating the advantage, what is the advantage of taking this action, and what does it imply?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science