Notational Variations in State-Action Sequences (Trajectories)
A state-action sequence, or trajectory (τ), documents the path an agent takes through an environment. While the core concept is consistent, the notation used to represent these sequences can vary. For instance, a trajectory may be denoted as starting from time step 1, such as τ = {(s₁, a₁), (s₂, a₂), ...}, often to align with other notation in a specific context, like sequence prediction. Alternatively, it is common in reinforcement learning literature to see trajectories starting from time step 0, with varying lengths, such as τ = {(s₀, a₀), ..., (sT, aT)} or τ = {(s₀, a₀), ..., (sT-₁, aT-₁)}. These notational differences are a matter of convention and do not alter the fundamental principles or models being discussed.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Markov Process
Markov Decision Process
Return
Limitations of Reinforcement Learning
Policy
State-Value and Action-Value Functions
Notational Variations in State-Action Sequences (Trajectories)
Objective Function as Expected Cumulative Reward (Performance Function)
An agent operates in an environment where sequences of events unfold over time. The agent's behavior is described by a policy, denoted as π(a|s), which gives the probability of taking action 'a' when in state 's'. The environment's dynamics are described by a transition function, P(s'|s, a), which gives the probability of moving to the next state 's'' after taking action 'a' in state 's'. The process begins from an initial state, s₀, with a probability of P(s₀).
Consider the following specific two-step sequence of events (a trajectory):
- The process starts in state s₀.
- The agent takes action a₀.
- The environment transitions to state s₁.
- The agent takes action a₁.
- The environment transitions to state s₂.
Which expression correctly represents the probability of this entire specific trajectory occurring?
Diagnosing a Faulty Sequence Generation Process
When modeling the generation of a sequence of states and actions as a Markov Decision Process, the probability of transitioning to a new state at any given step depends on the complete history of all states and actions that have occurred since the beginning of the sequence.
Notational Variations in State-Action Sequences (Trajectories)
An agent is generating a sequence by interacting with an environment. For a single time step, starting from state
s_t, arrange the following events in the correct logical order.
Learn After
Cumulative Reward of a Trajectory
An agent in an environment completes a sequence of two actions. It starts in an initial state
s₀, performs actiona₀to reach states₁, and then performs actiona₁to reach the final states₂. Which of the following notations correctly represents the full sequence of state-action pairs, often called a trajectory (τ)?Critiquing Trajectory Notations
An agent interacts with an environment for a total of
Ttime steps, resulting in a sequence of states and actions. Match each mathematical notation for this sequence (trajectory, τ) to the description that accurately characterizes its structure and length.